8

For reasons I really do not understand, a REST API I'm using, instead of outputting JSON or XML, uses a peculiar structured text format. In its simplest form

SECTION_NAME    entry  other qualifying bits of the entry
                entry2 other qualifying bits
                ...

They are not tab-delimited, as the structure may seem, but instead space-delimited, and the qualifying bits may contain words with spaces. The space between SECTION_NAME and the entries is also variable, ranging from 1 to several (6 or more) spaces.

Also, one part of the format contains entries in the form

SECTION_NAME entry
  SUB_SECTION more information
  SUB_SECTION2 more information

For reference, an extract of real data (some sections omitted), which shows the use of the structure:

ENTRY       hsa04064                    Pathway
NAME        NF-kappa B signaling pathway - Homo sapiens (human)
DRUG        D09347  Fostamatinib (USAN)
            D09348  Fostamatinib disodium (USAN)
            D09692  Veliparib (USAN/INN)
            D09730  Olaparib (JAN/INN)
            D09913  Iniparib (USAN/INN)
REFERENCE   PMID:21772278
  AUTHORS   Oeckinghaus A, Hayden MS, Ghosh S
  TITLE     Crosstalk in NF-kappaB signaling pathways.
  JOURNAL   Nat Immunol 12:695-708 (2011)

As I'm trying to parse this weird format into something saner (a dictionary which can then be converted to JSON), I'm unsure on what to do: splitting blindly on spaces causes a mess (it also affects information with spaces), and I'm not sure on how I can figure when a section starts or not. Is text manipulation enough for the job or should I use more sophisticated methods?

EDIT:

I started using pyparsing for the job, but multiple-line records baffle me, here's an example with DRUG:

 from pyparsing import *
 punctuation = ",.'`&-"
 special_chars = "\()[]"

 drug = Keyword("DRUG")
 drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
      alphanums + special_chars))) + ZeroOrMore(LineEnd())
 drug_lines = OneOrMore(drug_content)
 drug_parser = drug + drug_lines

When applied to the first 3 lines of DRUG in the example, I get a wrong result(\n converted to actual returns to ease readability):

 ['DRUG', ['D09347', 'Fostamatinib (USAN)
        D09348  Fostamatinib disodium      (USAN)
        D09692  Veliparib (USAN']]

As you can see, the subsequent entries get lumped all together, while I'd expect:

 ['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"],
           ['D09692', ' Veliparib (USAN)']]]
4
  • Have you tried splitting on whitespace, but limiting the number of splits? Commented Jul 4, 2012 at 8:28
  • @inspectorG4dget: I thought about it, but the single entries have variable-space requirements (so probably each section would require its specific number of splits) Commented Jul 4, 2012 at 8:41
  • Aha! Then perhaps re.split would be a better choice Commented Jul 4, 2012 at 8:44
  • 1
    You asked this on the pyparsing wiki too, see my response: pyparsing.wikispaces.com/message/view/home/55280466 Commented Jul 4, 2012 at 18:58

2 Answers 2

4

I'd recommend you use a parser-based approach. For example, Python PLY can be used for the task at hand.

Sign up to request clarification or add additional context in comments.

3 Comments

I'll be testing this and Antonio Beamud's solutions shortly. Thanks.
@Einar Another parsing option would be pyparsing.wikispaces.com which has reasonable documentation and plenty of examples
Added a try with pyparsing, getting there for single-line, but still having trouble with multi-line.
1

The best approach is to use regular expressions, like:

m = re.compile('^ENTRY\s+(.*)$')
m.search(line)
if m:
   m.groups()[0].strip()

for lines without entry, you should use the last entry you detected.

A simpler approach is split by entry, for example:

vals = line.split('DRUG')
if len(vals) > 1:
     drug_field = vals[1].strip()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.