1

I want to read some big file contents and check some column then store some lines of file based on the column value the sample line:

7774777761 72288833         2015/03/20     23:59:37       26       26   38 
  99944524 09671017         2015/03/20     23:59:44       18        1    8

I did it in Python this way:

import sys
if __name__=="__main__":
    if (len(sys.argv)<4):
        sys.stderr.write('Usage: trk finame fout  column value \n ')
        sys.exit(1)
    finame=open(sys.argv[1],'r')
    result=open(sys.argv[2],'w')
    nos=open("nos.txt",'w')
    col=int(sys.argv[3])
    val=sys.argv[4]
    for l in finame:
        llist=l.split()
        try:
            if llist[col]==val:
                result.write(l)
        except:
            nos.write(l)
    result.close()
    nos.close()

and then tried to do it in C++ using regexp:

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <regex>
using namespace std;

int main(int argc, char* argv[])
{
  ifstream fstr;
  ofstream ofstr;
  string istr,result;
  int col;
  string val;
  if(argc<5){
    cout<<"you must enter right arguments"<<endl;
    cout<<"colgrab inputfile outputfile desired_col desired_val"<<endl;
    cout<<"for example :"<<endl;
    cout<<"colgrab TrkTicket.txt INCOM_HWI.txt 6 1"<<endl;
  }else{
    fstr.open(argv[1]);
    ofstr.open(argv[2]);
    col=atoi(argv[3]);
    val=argv[4];
    if(!fstr)
    {
      cerr << "File could not be opened" << endl;
      exit( 1 );
    }

    if(!ofstr)
    {
      cerr << "File could not be opened" << endl;
      exit( 1 );
    }
  }

  while(getline(fstr,istr)){
    //  cout<<istr<<endl;
    try {
      regex re(R"XXX( *(\d+) +(\d+) +(\d+/\d+/\d+) +(\d+:\d+:\d+) +(\d+) +(\d+) +(\d+).)XXX");
      std::smatch match;
      //cout<<istr<<endl;
      if (regex_search(istr, match, re) && match.size() > 1) {
        result = match.str(col);

        if(val==result){
          ofstr<<istr<<endl;
        }
        //cout<<result<<endl;
      } else {
        //result = std::string("No match found");
        //cout<<result<<endl;

      }
    } catch (std::regex_error& e) {
      // Syntax error in the regular expression
      //cerr<<"Syntax error in the regular expression "<<endl;
    }
  }


  return 0;
}

My purpose to doing this was speed. But the thing surprised me was that Python version did the job in less than 10 sec for a 270 Mb file, but C++ version could not finish the job on 10 min.

How can I fix c++ version to do the job in less time?


Python version python 3.2

C++ version GCC G++ 4.9.1


Edit 1

I tried all proposed ways and with MikeMB way they are almost even :

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <regex>
using namespace std;

int main(int argc, char* argv[])
{
    ifstream fstr;
    ofstream ofstr;
    string istr,result;
    int col;
    string val;
    if(argc<5){
        cout<<"you must enter right arguments"<<endl;
        cout<<"colgrab inputfile outputfile desired_col desired_val"<<endl;
        cout<<"for example :"<<endl;
        cout<<"colgrab TrkTicket.txt INCOM_HWI.txt 6 1"<<endl;
    }else{
    fstr.open(argv[1]);
    ofstr.open(argv[2]);
    col=atoi(argv[3]);
    val=argv[4];
    if(!fstr)
       {
          cerr << "File could not be opened" << endl;
          exit( 1 );
       }

    if(!ofstr)
       {
          cerr << "File could not be opened" << endl;
          exit( 1 );
       }
    }

while(getline(fstr,istr)){
        stringstream sstr(istr);
        int i = 0;
        while (sstr >> result) {
           if (i == col-1 && result == val) {
               ofstr << istr << "\n";
               break;
           }
           i++;
        }
 


    return 0;
}

is there a way to improve performance more?

6
  • 1
    Well, the regex matching is somewhat more complex than the l.split() you let python do, but I suspect that the main culprit is I/O. Replace the endl in ofstr<<istr<<endl with '\n' to avoid flushing all the time, does that change things? Commented Apr 21, 2015 at 20:18
  • 4
    First thing you can try is to move the regex re(...); part outside of the while loop. Also: did you compile with optimizations? Commented Apr 21, 2015 at 20:18
  • I tried both and the time came down to about 60 sec but still about 6 times of python version.very interesting that replacing endl with "\n" has that much effect and i did compile with o3 optimization Commented Apr 21, 2015 at 20:31
  • What if you get rid of regex entirely? For example by using operator>> from ifstream? Or by using getline and splitting it using stringstream? I strongly suspect that using regular expressions is the issue here. Commented Apr 21, 2015 at 20:45
  • can you explain more about each method? Commented Apr 21, 2015 at 20:56

3 Answers 3

1

The answer to Edit 1:

  • Add std::ios::sync_with_stdio(false); as first line of main() for nice speed boost.

  • You don't need to read a line and then convert it to stringstream - you can read values directly from fstr to avoid copying.

  • To get your data in milliseconds you can use indexed data format, for example import your data to SQLite database, index the columns and use database queries to extract it.

Sign up to request clarification or add additional context in comments.

4 Comments

how can i read ifstream directly to stringstream line by line?
I mean you saied: "You don't need to read a line and then convert it to stringstream - you can read values directly from fstr to avoid copying." how can i do this?
@pumper something like vector<string> values(7); while (fstr.good()) { for (u_int i=0; i<values.size(); i++) { fstr>>values[i] } fstr.ignore(std::numeric_limits<std::streamsize>::max(), '\n'); /* .... */ }
hi using this method i get a time almost half stringstream.thanks i think i get what i want.thanks all
0

In order to remove the regex entirely, you can try this (only the while loop is modified):

while(getline(fstr,istr)){
    stringstream sstr(istr);
    string field[7];
    for(int i = 0; i < 7; i++)
        sstr >> field[i];

    // do whatever you want with read values
}

This should work assuming that each line you read has 7 columns and the values in the columns do not contain whitespaces.

Comments

0

Constructing a regex is quite expensive as it involves constructing a state machine, so, as mentioned in the comments, you should move the regex construction out of the loop, so you have to pay that price only once.

However, for such a simple case you probably don't need a regex at all. I'm not sure whether this is actually faster, but you could try the following:

while (getline(fstr, istr)){        
    std::stringstream ss(istr);
    int i = 0;      
    while (ss >> result) {              
        if (i == col && result == val) {
            ofstr << istr << "\n";
            break;
        }
        i++;
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.