0

I am Trying to parse a verbs english lexicon in order to built a NLP application using Python, so I have to merge it with my NLTK scripts, the lexicon is a lisp-readable file of property lists, but I need it in a easier formart like a Json file or a pandas dataframe.

An example from that Lexicon database is:

;; Grid: 51.2#1#_th,src#abandon#abandon#abandon#abandon+ingly#(1.5,01269572,01188040,01269413,00345378)(1.6,01524319,01421290,01524047,00415625)###AD

(
 :DEF_WORD "abandon"
 :CLASS "51.2"
 :WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
            ("1.6" 01524319 01421290 01524047 00415625))
 :PROPBANK ("arg1 arg2")
 :THETA_ROLES ((1 "_th,src"))
 :LCS (go loc (* thing 2)
          (away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
          (abandon+ingly 26))
 :VAR_SPEC ((4 :optional) (2 (animate +)))
)

;; Grid: 45.4.a#1#_ag_th,instr(with)#abase#abase#abase#abase+ed#(1.5,01024949)(1.6,01228249)###AD

(
 :DEF_WORD "abase"
 :CLASS "45.4.a"
 :WN_SENSE (("1.5" 01024949)
            ("1.6" 01228249))
 :PROPBANK ("arg0 arg1 arg2(with)")
 :THETA_ROLES ((1 "_ag_th,instr(with)"))
 :LCS (cause (* thing 1)
       (go ident (* thing 2)
           (toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
       ((* with 19) instr (*head*) (thing 20)))
 :VAR_SPEC ((1 (animate +)))
)

The complete data is avaible here https://raw.githubusercontent.com/ihmc/LCS/master/verbs-English.lcs

I have tried the idea published in this post Parsing a lisp file with Python using something like this, but I have obtained a format not as similar as I am looking for it

inputdata = '''
(
 :DEF_WORD "abandon"
 :CLASS "51.2"
 :WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
            ("1.6" 01524319 01421290 01524047 00415625))
 :PROPBANK ("arg1 arg2")
 :THETA_ROLES ((1 "_th,src"))
 :LCS (go loc (* thing 2)
          (away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
          (abandon+ingly 26))
 :VAR_SPEC ((4 :optional) (2 (animate +)))
)


(
 :DEF_WORD "abase"
 :CLASS "45.4.a"
 :WN_SENSE (("1.5" 01024949)
            ("1.6" 01228249))
 :PROPBANK ("arg0 arg1 arg2(with)")
 :THETA_ROLES ((1 "_ag_th,instr(with)"))
 :LCS (cause (* thing 1)
       (go ident (* thing 2)
           (toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
       ((* with 19) instr (*head*) (thing 20)))
 :VAR_SPEC ((1 (animate +)))
)'''

from pyparsing import OneOrMore, nestedExpr

data = OneOrMore(nestedExpr()).parseString(inputdata)
print (data)

I got an output like this:

[
  [ ':DEF_WORD', '"abandon"', 
    ':CLASS', '"51.2"', 
    ':WN_SENSE', [
                    ['"1.5"', '01269572', '01188040', '01269413', '00345378'], 
                    ['"1.6"', '01524319', '01421290', '01524047', '00415625']
                 ],
    ':PROPBANK', ['"arg1 arg2"'],
    ':THETA_ROLES', [['1', '"_th,src"']],
    ':LCS', ['go', 'loc', ['*', 'thing', '2'], 
          ['away_from', 'loc', ['thing', '2'], 
          ['at', 'loc', ['thing', '2'], ['*', 'thing', '4']]], ['abandon+ingly', '26']],
    ':VAR_SPEC', [['4', ':optional'], ['2', ['animate', '+']]]]
  ,     
  [':DEF_WORD', '"abase"', 
    ':CLASS', '"45.4.a"', 
    ':WN_SENSE', [
                    ['"1.5"', '01024949'],
                    ['"1.6"', '01228249']
                ], 
    ':PROPBANK', ['"arg0 arg1 arg2(with)"'], 
    ':THETA_ROLES', [['1', '"_ag_th,instr(with)"']],
    ':LCS', ['cause', ['*', 'thing', '1'], 
              ['go', 'ident', ['*', 'thing', '2'], 
              ['toward', 'ident', ['thing', '2'], 
              ['at', 'ident', ['thing', '2'],
              ['abase+ed', '9']]]],
              [['*', 'with', '19'], 'instr', ['*head*'], ['thing', '20']]], 
    ':VAR_SPEC', [['1', ['animate', '+']]]
  ]
]

I am not sure how to handle this output format in order to get e.g THETA_ROLES value or another verbs characteristics in this lexicon, I have all my sentences in an array using pandas and NLTK so the idea is to look for sentences that have a kind of verbs with and especific THETA_ROLES value or other characteristics present in this lexicon.

1 Answer 1

1

The data you have gotten is a flat sequence of pairs of key-values. That is, you have something of the form ["A", 1, "B", 2], but you want a dict like {"A": 1, "B": 2}.

Here is a generator that will return a flattened sequence as a sequence of pairs:

def pairs(seq):
    for x, y in zip(seq[::2], seq[1::2]):
        yield (x, y)

print(dict(pairs(["A", 1, "B", 2])))

Use that method to convert each parsed group into a Python dict, from which you can then easily extract bits by name.

for group in data:
    groupdict = dict(pairs(group))
    print(groupdict[":THETA_ROLES"])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.