How to parse a tree from a string in Python?

Question

I have formatted all my university notes as follows:

CourseName: {
    Part 1: {
        I.I - Intro: {
            Topic1: {
                descr1;
                descr2: {
                    2.a;
                    2.b;
                    2.c.
                };
                descr3.
            };
            Topic2: {
                descr: {
                    example.
                }.
            }.
        };
        I.II - NextChapter: {
            Topic3: {
                whatever.
            }.
        }.
    };
    Part 2: {
        II.I - FinalChapter: {
            content.
        }.
    }.
}

I'd like to structure them into a Tree data structure and I've tried doing so, both recursively and iteratively, in the past hours, doing many researches online, but none of my attempts at it is working.

I've already implemented a Node class (with self.__value and a list self.__children and all the useful methods you would expect from it) as well as a Tree class (with self.__nodes as a dictionary and other utility methods), so feel free to use methods such as add_node or add_child in any form of your liking in your answers.

What I'm struggling with is to understand how to structure the function def parseTree(s, l) - that ideally takes as inputs a string s (my notes) and a list l establishing the delimiters i.e. [":{", ";", "}."] or ["{","}"] or similar - and returns a tree object, with each node having as value the text preceding :{ and a list of children (if any) separated by ; in the text.

Any suggestion?

You should reformat them into a parseable form (JSON, yaml, etc) first. — chepner
– chepner, Commented Mar 25, 2018 at 17:49
chepner and Daniel Roseman are right—it's always easier to use an existing parser. But if you do want to do this yourself, I think what you may be missing is that you're 75% of the way to designing a custom recursive descent parser but haven't hit on the idea of implementing it recursively yet. That may help you search. But, on the other hand, your language is more that simple enough that you could write a simple grammar to feed to a parser generator or parser framework instead of writing code, so you may want to search on that instead. — abarnert
– abarnert, Commented Mar 25, 2018 at 17:57
@ekhumoro I know I had to show it but I kinda couldn't because it was a not-running mess, result of trial and error and more of a general idea on how to approach the problem rather than actual code — Flynn Rhodes
– Flynn Rhodes, Commented Mar 26, 2018 at 14:44
@abarnert Thank you so much, this is exactly what I was trying to do and looking for. A whole new world just opened up to me! I'm also willing to reformat my "grammar" to make it as simple and parsable as possible for this to work. But 75% of the way it's a long shot... I'm still quite clueless on how to make progress. — Flynn Rhodes
– Flynn Rhodes, Commented Mar 26, 2018 at 14:49

Daniel Roseman · Accepted Answer · 2018-03-25 17:50:28Z

4

This is actually almost syntactically valid YAML. A simple substitution will make it valid:

data = data.replace(';', ',').replace('.', '')
parsed = yaml.load(data)

answered Mar 25, 2018 at 17:50

Daniel Roseman

602k68 gold badges910 silver badges923 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

chepner Over a year ago

You probably want some more specific replacements, at the very least just ; and . that end a line.

Ajax1234 · Accepted Answer · 2018-03-26 15:16:35Z

2

Assuming your data is stored in a file, you can build a simple class to parse the structure into a dictionary. You can recursively traverse the data by creating a new Notes object for each key found:

file_data = filter(None, [i.strip('\n') for i in open('filename.txt')])
import re
class Notes:
   def __init__(self, token_data):
     self.token_data = token_data
     self.current_dict = {}
     self.current_vals = []
     self.parse()
   def parse(self):
     while True:
       start = next(self.token_data, None)
       if not start or "}" in start:
         break
       if start.endswith('{'):
          note = Notes(self.token_data)
          final_result = filter(lambda x:x, note.current_vals + [note.current_dict]) if note.current_vals else note.current_dict
          self.current_dict[re.findall('[\w\s\-\.]+', re.sub('^\s+', '', start))[0]] = final_result[0] if isinstance(final_result, list) and len(final_result) == 1 else final_result
          self.token_data = note.token_data
       else:
          self.current_vals.append(re.sub('^\s+', '', start))


course_notes = Notes(iter(file_data)).current_dict

Output:

{'CourseName': 
    {'Part 1': 
      {'I.I - Intro': 
         {'Topic1': ['descr1;',
                    'descr3.',
             {'descr2': ['2.a;',
                         '2.b;',
                          '2.c.']
                }
               ],
         'Topic2': {'descr': 'example.'}
                 },
                  'I.II - NextChapter': 
               {'Topic3': 'whatever.'}
             },
       'Part 2':{'II.I - FinalChapter': 'content.'}
     } 
   }

edited Mar 26, 2018 at 15:16

answered Mar 25, 2018 at 17:57

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

2 Comments

Flynn Rhodes Over a year ago

This is cool and effective and yes, data is stored in a text file indeed. I just wish I could understand half of the lines of code you've written, haha

Ajax1234 Over a year ago

@FlynnRhodes Glad to help! I added more of an explanation in the post, however, all the code does is create a new class for every occurrence of a data key. The file data is store as an iterator, that way, it can be more easily determined when the data source is exhausted.

Collectives™ on Stack Overflow

How to parse a tree from a string in Python?

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related