Parsing a structured text file in Python (pyparsing)

Question

For reasons I really do not understand, a REST API I'm using, instead of outputting JSON or XML, uses a peculiar structured text format. In its simplest form

SECTION_NAME    entry  other qualifying bits of the entry
                entry2 other qualifying bits
                ...

They are not tab-delimited, as the structure may seem, but instead space-delimited, and the qualifying bits may contain words with spaces. The space between SECTION_NAME and the entries is also variable, ranging from 1 to several (6 or more) spaces.

Also, one part of the format contains entries in the form

SECTION_NAME entry
  SUB_SECTION more information
  SUB_SECTION2 more information

For reference, an extract of real data (some sections omitted), which shows the use of the structure:

ENTRY       hsa04064                    Pathway
NAME        NF-kappa B signaling pathway - Homo sapiens (human)
DRUG        D09347  Fostamatinib (USAN)
            D09348  Fostamatinib disodium (USAN)
            D09692  Veliparib (USAN/INN)
            D09730  Olaparib (JAN/INN)
            D09913  Iniparib (USAN/INN)
REFERENCE   PMID:21772278
  AUTHORS   Oeckinghaus A, Hayden MS, Ghosh S
  TITLE     Crosstalk in NF-kappaB signaling pathways.
  JOURNAL   Nat Immunol 12:695-708 (2011)

As I'm trying to parse this weird format into something saner (a dictionary which can then be converted to JSON), I'm unsure on what to do: splitting blindly on spaces causes a mess (it also affects information with spaces), and I'm not sure on how I can figure when a section starts or not. Is text manipulation enough for the job or should I use more sophisticated methods?

EDIT:

I started using pyparsing for the job, but multiple-line records baffle me, here's an example with DRUG:

 from pyparsing import *
 punctuation = ",.'`&-"
 special_chars = "\()[]"

 drug = Keyword("DRUG")
 drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
      alphanums + special_chars))) + ZeroOrMore(LineEnd())
 drug_lines = OneOrMore(drug_content)
 drug_parser = drug + drug_lines

When applied to the first 3 lines of DRUG in the example, I get a wrong result(\n converted to actual returns to ease readability):

 ['DRUG', ['D09347', 'Fostamatinib (USAN)
        D09348  Fostamatinib disodium      (USAN)
        D09692  Veliparib (USAN']]

As you can see, the subsequent entries get lumped all together, while I'd expect:

 ['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"],
           ['D09692', ' Veliparib (USAN)']]]

Have you tried splitting on whitespace, but limiting the number of splits? — inspectorG4dget
– inspectorG4dget, Commented Jul 4, 2012 at 8:28
@inspectorG4dget: I thought about it, but the single entries have variable-space requirements (so probably each section would require its specific number of splits) — Einar
– Einar, Commented Jul 4, 2012 at 8:41
You asked this on the pyparsing wiki too, see my response: pyparsing.wikispaces.com/message/view/home/55280466 — PaulMcG
– PaulMcG, Commented Jul 4, 2012 at 18:58

Mihai Maruseac · Accepted Answer · 2012-07-04 08:38:46Z

4

I'd recommend you use a parser-based approach. For example, Python PLY can be used for the task at hand.

answered Jul 4, 2012 at 8:38

Mihai Maruseac

21.5k7 gold badges61 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Einar Over a year ago

I'll be testing this and Antonio Beamud's solutions shortly. Thanks.

Jon Clements Over a year ago

@Einar Another parsing option would be pyparsing.wikispaces.com which has reasonable documentation and plenty of examples

Einar Over a year ago

Added a try with pyparsing, getting there for single-line, but still having trouble with multi-line.

Antonio Beamud · Accepted Answer · 2012-07-04 08:40:54Z

1

The best approach is to use regular expressions, like:

m = re.compile('^ENTRY\s+(.*)$')
m.search(line)
if m:
   m.groups()[0].strip()

for lines without entry, you should use the last entry you detected.

A simpler approach is split by entry, for example:

vals = line.split('DRUG')
if len(vals) > 1:
     drug_field = vals[1].strip()

answered Jul 4, 2012 at 8:40

Antonio Beamud

2,3511 gold badge16 silver badges27 bronze badges

Collectives™ on Stack Overflow

Parsing a structured text file in Python (pyparsing)

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related