tricky string parsing with python

Question

I have a text file like this:

ID = 31
Ne = 5122
============
List of 104 four tuples:
1    2    12    40
2    3    4     21
.
.
51   21   41    42   

ID = 34
Ne = 5122
============
List of 104 four tuples:
3    2    12    40
4    3    4     21
.
.

The four-tuples are tab delimited.

For each ID, I'm trying to make a dictionary with the ID being the key and the four-tuples (in list/tuple form) as elements of that key.

 dict = {31: (1,2,12,40),(2,3,4,21)....., 32:(3,2,12,40), (4,3,4,21)..

My string parsing knowledge is limited to adding using a reference object for file.readlines(), using str.replace() and str.split() on 'ID = '. But there has to be a better way. Here some beginnings of what I have.

file = open('text.txt', 'r')
fp = file.readlines()
B = [];
for x in fp:
    x.replace('\t',',')
    x.replace('\n',')')
    B.append(x)

You could try to write a grammar using a lib like pyparsing or ply — bufh
– bufh, Commented Jul 22, 2015 at 20:37
If you're still around, could you mark one of the answers to this question as correct? — Engineero
– Engineero, Commented May 24, 2017 at 16:45

gkusner · Accepted Answer · 2015-07-22 20:03:19Z

2

something like this:

ll = []
for line in fp:
    tt = tuple(int(x) for x in line.split())
    ll.append(tt)

that will produce a list of tuples to assign to the key for your dictionary

answered Jul 22, 2015 at 20:03

gkusner

1,2441 gold badge11 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bbill · Accepted Answer · 2015-07-23 18:57:51Z

2

Python's great for this stuff, why not write up a 5-10 liner for it? It's kind of what the language is meant to excel at.

$ cat test
ID = 31
Ne = 5122
============
List of 104 four tuples:
1       2       12      40
2       3       4       21

ID = 34
Ne = 5122
============
List of 104 four tuples:
3       2       12      40
4       3       4       21


data = {}
for block in open('test').read().split('ID = '):
    if not block:
        continue #empty line
    lines = block.split('\n')
    ID = int(lines[0])
    tups = map(lambda y: int(y), [filter(lambda x: x, line.split('\t')) for line in lines[4:]])
    data[ID] = tuple(filter(lambda x: x, tups))
print(data)

# {34: ([3, 2, 12, 40], [4, 3, 4, 21]), 31: ([1, 2, 12, 40], [2, 3, 4, 21])}

Only annoying thing is all the filters - sorry, that's just the result of empty strings and stuff from extra newlines, etc. For a one-off little script, it's no biggie.

edited Jul 23, 2015 at 18:57

answered Jul 22, 2015 at 20:38

bbill

2,3341 gold badge23 silver badges28 bronze badges

2 Comments

Niles Bernoulli Over a year ago

hey this worked excellently. i needed the tuples as ints, so i made a quick lambda func: lambda A: [int(x) for x in A] and it looked great. thank you for this!

bbill Over a year ago

Cool, forgot about that detail. I added the map in there.

Engineero · Accepted Answer · 2015-07-22 21:35:51Z

I think this will do the trick for you:

import csv

def parse_file(filename):
  """
  Parses an input data file containing tags of the form "ID = ##" (where ## is a
  number) followed by rows of data. Returns a dictionary where the ID numbers
  are the keys and all of the rows of data are stored as a list of tuples
  associated with the key.

  Args:
    filename (string) name of the file you want to parse

  Returns:
    my_dict (dictionary) dictionary of data with ID numbers as keys

  """
  my_dict = {}
  with open(filename, "r") as my_file:  # handles opening and closing file
    rows = my_file.readlines()
    for row in rows:
      if "ID = " in row:
        my_key = int(row.split("ID = ")[1])  # grab the ID number
        my_list = []  # initialize a new data list for a new ID
      elif row != "\n":  # skip rows that only have newline char
        try:  # if this fails, we don't have a valid data line
          my_list.append(tuple([int(x) for x in row.split()]))
        except:
          my_dict[my_key] = my_list  # stores the data list
          continue  # repeat until done with file
  return my_dict

I made it a function so that you can it from anywhere, just passing the filename. It makes assumptions about the file format, but if the file format is always what you showed us here, it should work for you. You would call it on your data.txt file like:

a_dictionary = parse_file("data.txt")

I tested it on the data that you gave us and it seems to work just fine after deleting the "..." rows.

Edit: I noticed one small bug. As written, it will add an empty tuple in place of a new line character ("\n") wherever that appears alone on a line. To fix this, put the try: and except: clauses inside of this:

elif row != "\n":  # skips rows that only contain newline char

I added this to the full code above as well.

Collectives™ on Stack Overflow

tricky string parsing with python

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related