0

I have a file with following data:

<<row>>12|xyz|abc|2.34<</row>>
<<eof>>

The file may have several rows like this. I am trying to design a parser which will parse each row present in this file and return an array with all rows. What would be the best way of doing it? The code has to be written in python. Code should not take rows that do not start with <<row>> or should raise error.

=======> UPDATE <========

I just found that a particular <<row>> can span multiple lines. So my code and the code present below aren't working anymore. Can someone please suggest an efficient solution?

The data files can contain hundreds to several thousands of rows.

4
  • 3
    Looks like a pretty straightforward task. Where are you having problems? Commented May 27, 2013 at 19:04
  • It is a simple task I know but I want to know how a different programmer would solve it. So. Commented May 27, 2013 at 19:14
  • 1
    Post the solution you already have. You will get advise how to improve on it. Commented May 27, 2013 at 19:18
  • While working with the code, i found that rows in the data files are not restricted to one line. So a particular <<row>> can span multiple lines. So my code isn't working anymore. And neither the ones answered below. Can you please help? Should i re-post this as a new question? or edit the question? Commented May 27, 2013 at 22:05

3 Answers 3

1

A simple way without regular expressions:

output = []
with open('input.txt', 'r') as f:
    for line in f:
        if line == '<<eof>>':
            break
        elif not line.startswith('<<row>>'):
            continue
        else:
            output.append(line.strip()[7:-8].split('|'))

This uses every line starting with <<row>> until a line contains only <<eof>>

Sign up to request clarification or add additional context in comments.

Comments

1
def parseFile(fileName):
  with open(fileName) as f:

    def parseLine(line):
      m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
      if m:
        return m.groups()

    return [ values for values in (
      parseLine(line)
        for line in f
        if line.startswith('<<row>>')) if values ]

And? Am I different? ;-)

5 Comments

I guess. But doing it without using regular expressions is better I believe.
How come you believe this?? Regexp is more general. Using split etc. always is kind of using a special version for a special case. In case in the future a slightly modified version of the format pops up, adjusting the regexp is a cinch while making up a new version using simpler parsing mechanisms quickly is unable to cope with the task.
String library is faster. So it would make more sense for me to do it without using regex as these files are going to contain thousands of rows. These files contain data that we are buying from a data provider so i have no choice in terms of input data format.
And to solve the issue of version updates, I'm putting the whole parser into a class with constants that store the beginning and ending sequence. So later i can just change the values. That is why i was looking for a different solution as I thought i might unnecessarily be making a whole class when the task is very simple.
Ah, »The file may have several rows like this« does not really sound like thousands ;-) In this case I'd propose to use a generator to produce the output (using yield). Neither TobiMarg's nor my solution then is appropriate.
0

A good practice is to test for unwanted cases and ignore them. Once you are sure that you have a compliant line, you process it. Note that the actual processing is not in an if statement. Without rows split across several lines, you need only two tests:

rows = list()
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            continue
        row = line[7:-8]
        rows.append(row)

With rows split across several lines, you need to save the previous line in some situations:

rows = list()
prev = None
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>') and prev is not None:
            line = prev + line
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            prev = line
            continue
        row = line[7:-8]
        rows.append(row)
        prev = None

If needed, you can split columns with: cols = row.split('|')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.