9

I have the following lisp file, which is from the UCI machine learning database. I would like to convert it into a flat text file using python. A typical line looks like this:

(1 ((st 8) (pitch 67) (dur 4) (keysig 1) (timesig 12) (fermata 0))((st 12) (pitch 67) (dur 8) (keysig 1) (timesig 12) (fermata 0)))

I would like to parse this into a text file like:

time pitch duration keysig timesig fermata
8    67    4        1      12      0
12   67    8        1      12      0

Is there a python module to intelligently parse this? This is my first time seeing lisp.

3
  • Does Parsing S-Expressions in Python help? Commented Dec 27, 2012 at 17:55
  • 1
    Why not use lisp to convert it to another format? Commented Dec 27, 2012 at 18:02
  • 3
    What's the learning curve involved in learning enough lisp to do that? Commented Dec 27, 2012 at 18:06

5 Answers 5

24

As shown in this answer, pyparsing appears to be the right tool for that:

inputdata = '(1 ((st 8) (pitch 67) (dur 4) (keysig 1) (timesig 12) (fermata 0))((st 12) (pitch 67) (dur 8) (keysig 1) (timesig 12) (fermata 0)))'

from pyparsing import OneOrMore, nestedExpr

data = OneOrMore(nestedExpr()).parseString(inputdata)
print data

# [['1', [['st', '8'], ['pitch', '67'], ['dur', '4'], ['keysig', '1'], ['timesig', '12'], ['fermata', '0']], [['st', '12'], ['pitch', '67'], ['dur', '8'], ['keysig', '1'], ['timesig', '12'], ['fermata', '0']]]]

For the completeness' sake, this is how to format the results (using texttable):

from texttable import Texttable

tab = Texttable()
for row in data.asList()[0][1:]:
    row = dict(row)
    tab.header(row.keys())
    tab.add_row(row.values())
print tab.draw()
+---------+--------+----+-------+-----+---------+
| timesig | keysig | st | pitch | dur | fermata |
+=========+========+====+=======+=====+=========+
| 12      | 1      | 8  | 67    | 4   | 0       |
+---------+--------+----+-------+-----+---------+
| 12      | 1      | 12 | 67    | 8   | 0       |
+---------+--------+----+-------+-----+---------+

To convert that data back to the lisp notation:

def lisp(x):
    return '(%s)' % ' '.join(lisp(y) for y in x) if isinstance(x, list) else x

d = lisp(d[0])
Sign up to request clarification or add additional context in comments.

1 Comment

This is definitely the answer since the Op asked for "a python module to intelligently parse this"
2

If you know that the data is correct and the format uniform (seems so at a first sight), and if you need just this data and don't need to solve the general problem... then why not just replacing every non-numeric with a space and then going with split?

import re
data = open("chorales.lisp").read().split("\n")
data = [re.sub("[^-0-9]+", " ", x) for x in data]
for L in data:
    L = map(int, L.split())
    i = 1  # first element is chorale number
    while i < len(L):
        st, pitch, dur, keysig, timesig, fermata = L[i:i+6]
        i += 6
        ... your processing goes here ...

Comments

1

Separate it into pairs with a regular expression:

In [1]: import re

In [2]: txt = '(((st 8) (pitch 67) (dur 4) (keysig 1) (timesig 12) (fermata 0))((st 12) (pitch 67) (dur 8) (keysig 1) (timesig 12) (fermata 0)))'

In [3]: [p.split() for p in re.findall('\w+\s+\d+', txt)]
Out[3]: [['st', '8'], ['pitch', '67'], ['dur', '4'], ['keysig', '1'], ['timesig', '12'], ['fermata', '0'], ['st', '12'], ['pitch', '67'], ['dur', '8'], ['keysig', '1'], ['timesig', '12'], ['fermata', '0']]

Then make it into a dictionary:

dct = {}
for p in data:
    if not p[0] in dct.keys():
        dct[p[0]] = [p[1]]
    else:
        dct[p[0]].append(p[1])

The result:

In [10]: dct
Out[10]: {'timesig': ['12', '12'], 'keysig': ['1', '1'], 'st': ['8', '12'], 'pitch': ['67', '67'], 'dur': ['4', '8'], 'fermata': ['0', '0']}

Printing:

print 'time pitch duration keysig timesig fermata'
for t in range(len(dct['st'])):
    print dct['st'][t], dct['pitch'][t], dct['dur'][t], 
    print dct['keysig'][t], dct['timesig'][t], dct['fermata'][t]

Proper formatting is left as an exercise for the reader...

Comments

0

cSince the data is already in Lisp, use lisp itself to manipulate the data into a well-known format like CSV or TSV:

    (let ((input '(1 ((ST 8) (PITCH 67) (DUR 4) (KEYSIG 1) (TIMESIG 12) (FERMATA 0))
                    ((ST 12) (PITCH 67) (DUR 8) (KEYSIG 1) (TIMESIG 12) (FERMATA 0)))))
               (let*
                   ((headers (mapcar #'first (cadr input)))
                    (rows (cdr input))
                    (row-data (mapcar (lambda (row) (mapcar #'second row)) rows))
                    (csv (cons headers row-data)))
                 (format t "~{~{~A~^,~}~^~%~}" csv)))
ST,PITCH,DUR,KEYSIG,TIMESIG,FERMATA
8,67,4,1,12,0
12,67,8,1,12,0

Comments

0

Don't use pyparsing, it's horribly slow. Roland's suggestion to use re makes a lot more sense, though his answer works only on specific forms of lisp input. Given the question's title, I imagine that a lot of folks come here looking to parse more general structures (in my case I was trying to parse a KiCad file). The following function is fully general and 50X faster than pyparsing for moderate sized input:

import re
TOK = re.compile('[()]|[^()" \t\n]+|("([^\\\\\"]|\\\\.)*")')

def Parse(inp):
    p     = 0
    stack = []
    while True:
        m = TOK.search(inp, p)
        g = m.group(0)
        p = m.end()
        #print(len(stack), g)
        if g == '(':
            stack.append([])
        elif g == ')':
            e = stack.pop()
            if not stack:
                return e
            stack[-1].append(e)
        else:
            stack[-1].append(g)

p = Parse('(1 ((st 8) (pitch 67) (dur 4) (keysig 1) (timesig 12) (fermata 0))((st 12) (pitch 67) (dur 8) (keysig 1) (timesig 12) (fermata 0)))')
from pprint import pprint
pprint(p)

The result is a raw parse tree:


['1',
 [['st', '8'],
  ['pitch', '67'],
  ['dur', '4'],
  ['keysig', '1'],
  ['timesig', '12'],
  ['fermata', '0']],
 [['st', '12'],
  ['pitch', '67'],
  ['dur', '8'],
  ['keysig', '1'],
  ['timesig', '12'],
  ['fermata', '0']]]

Then you can put each element in a dictionary and deal with it according to the application. We could probably improve the error handling a bit (it should stop if the space between tokens is not whitespace; it's probably an unterminated string literal).

In the case of my 8,000 line KiCad file, here is a time comparison:

  • 2.1 seconds to parse using pyparsing
  • 0.04 seconds to parse using re

Always be skeptical when anyone advises that something is "the right tool" for a job, unless you get paid by the hour. (Definitely get paid by the hour if you're required to work with hyped-up software packages.)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.