I am trying to parse pseudo-English scripts, and want to convert it into another machine readable language. However the script have been written by many people in the past, and each had their own style of writing.
some Examples would be:
- On Device 1 Set word 45 and 46 to hex 331
- On Device 1 set words 45 and 46 bits 3..7 to 280
- on Device 1 set word 45 to oct 332
- on device 1 set speed to 60kts Words 3-4 to hex 34 (there are many more different ways used in the source text)
The issue is its not always logical nor consistent
I have looked at Regexp, and matching certain words. This works out ok, but when I need to know the next word (e.g in 'Word 24' I would match for 'Word' then try to figure out if the next token is a number or not). In the case of 'Words' i need to look for the words to set, as well as their values.
in example 1, it should produce to Set word 45 to hex 331 and Set word 46 to hex 331
or if possible Set word 45 to hex 331 and word 46 to hex 331
i tried using the findall method on re - that would only give me the matched words, and then i have to try to find out the next word (i.e value) manually
alternatively, i could split the string using a space and process each word manually, then be able to do something like
assuming list is
['On', 'device1:', 'set', 'Word', '1', '', 'to', '88', 'and', 'word', '2', 'to', '2151']
for i in range (0,sp.__len__()):
rew = re.search("[Ww]ord", sp[i])
if rew:
print ("Found word, next val is ", sp[i+1])
is there a better way to do what i want? i looked a little bit into tokenizing, but not sure that would work as the language is not structured in the first place.
len(sp)instead