Parsing a string pattern (Python)

Question

I have a file with following data:

<<row>>12|xyz|abc|2.34<</row>>
<<eof>>

The file may have several rows like this. I am trying to design a parser which will parse each row present in this file and return an array with all rows. What would be the best way of doing it? The code has to be written in python. Code should not take rows that do not start with <<row>> or should raise error.

=======> UPDATE <========

I just found that a particular <<row>> can span multiple lines. So my code and the code present below aren't working anymore. Can someone please suggest an efficient solution?

The data files can contain hundreds to several thousands of rows.

Looks like a pretty straightforward task. Where are you having problems? — Cristian Lupascu
– Cristian Lupascu, Commented May 27, 2013 at 19:04
It is a simple task I know but I want to know how a different programmer would solve it. So. — user2426021
– user2426021, Commented May 27, 2013 at 19:14
Post the solution you already have. You will get advise how to improve on it. — Mike Müller
– Mike Müller, Commented May 27, 2013 at 19:18
While working with the code, i found that rows in the data files are not restricted to one line. So a particular <<row>> can span multiple lines. So my code isn't working anymore. And neither the ones answered below. Can you please help? Should i re-post this as a new question? or edit the question? — user2426021
– user2426021, Commented May 27, 2013 at 22:05

TobiMarg · Accepted Answer · 2013-05-27 19:18:39Z

1

A simple way without regular expressions:

output = []
with open('input.txt', 'r') as f:
    for line in f:
        if line == '<<eof>>':
            break
        elif not line.startswith('<<row>>'):
            continue
        else:
            output.append(line.strip()[7:-8].split('|'))

This uses every line starting with <<row>> until a line contains only <<eof>>

answered May 27, 2013 at 19:18

TobiMarg

3,8471 gold badge22 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alfe · Accepted Answer · 2013-05-27 19:17:06Z

1

def parseFile(fileName):
  with open(fileName) as f:

    def parseLine(line):
      m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
      if m:
        return m.groups()

    return [ values for values in (
      parseLine(line)
        for line in f
        if line.startswith('<<row>>')) if values ]

And? Am I different? ;-)

answered May 27, 2013 at 19:17

Alfe

60.2k21 gold badges117 silver badges172 bronze badges

5 Comments

user2426021 Over a year ago

I guess. But doing it without using regular expressions is better I believe.

Alfe Over a year ago

How come you believe this?? Regexp is more general. Using split etc. always is kind of using a special version for a special case. In case in the future a slightly modified version of the format pops up, adjusting the regexp is a cinch while making up a new version using simpler parsing mechanisms quickly is unable to cope with the task.

user2426021 Over a year ago

String library is faster. So it would make more sense for me to do it without using regex as these files are going to contain thousands of rows. These files contain data that we are buying from a data provider so i have no choice in terms of input data format.

user2426021 Over a year ago

And to solve the issue of version updates, I'm putting the whole parser into a class with constants that store the beginning and ending sequence. So later i can just change the values. That is why i was looking for a different solution as I thought i might unnecessarily be making a whole class when the task is very simple.

Alfe Over a year ago

Ah, »The file may have several rows like this« does not really sound like thousands ;-) In this case I'd propose to use a generator to produce the output (using yield). Neither TobiMarg's nor my solution then is appropriate.

Futal · Accepted Answer · 2020-05-22 19:39:27Z

A good practice is to test for unwanted cases and ignore them. Once you are sure that you have a compliant line, you process it. Note that the actual processing is not in an if statement. Without rows split across several lines, you need only two tests:

rows = list()
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            continue
        row = line[7:-8]
        rows.append(row)

With rows split across several lines, you need to save the previous line in some situations:

rows = list()
prev = None
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>') and prev is not None:
            line = prev + line
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            prev = line
            continue
        row = line[7:-8]
        rows.append(row)
        prev = None

If needed, you can split columns with: cols = row.split('|')

Collectives™ on Stack Overflow

Parsing a string pattern (Python)

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related