1

Brief version

I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).

Clarification: This is about indexing python source code, not English text that talks about python.

Background

My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".

I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.

The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]

Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:

for source in source_files:
    with open(source) as fp:
        code = fp.read()
        tree = ast.parse(code)
        for node in tree.walk():
           ... # Get node's keyword, identifier etc., and line number-- how?
               
           print(term, source, line)   # I do know how to make an index

So, is this a reasonable approach? Is there a better one? How should this be done?

1
  • Thanks for the reminder, but I don't have a favorite answer to this question yet. Commented Jan 28, 2016 at 23:53

1 Answer 1

2

Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.

I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.

That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).

Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.


Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.


Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)

Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.

This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.

Sign up to request clarification or add additional context in comments.

6 Comments

Good point about "indexing", I hadn't thought of that. But I see I didn't make it completely clear that I want to index python code (not the English text, God forbid-- that's a completely different task). Recognizing keywords is trivial, as long as my parser knows not to look in literal docstrings and the like.
To answer your other question, indeed I'll build an index of python keywords and builtins. Things like comprehensions are harder to recognize, but for them I can just scan the text (I know how to do that, it's not part of the question). I hope the question is clearer now. Maybe I should have just asked about how to use ast for tokenization...
Thanks! But I don't have to provide my own BNF for all of python, do I? I thought it comes built-in with ast. Could you provide a code snippet? I don't need any help with building the index, but how do I work with the (presumably) nodes returned by ast.walk()? I've looked at the documentation, and they're... complicated.
Sorry; I haven't worked with ast, just the concepts. My work has been with multi-language parsing, with blended BNFs. "How do I work with" requires a lot more definition -- what is your pseudo-code, and how far did you get with the actual coding? Design discussions shouldn't be in SO questions -- move to chat? As for being complicated, the node format is designed by the input grammar, I think. We need to identify the fields and values to target for you, and simply search the node stream for those. A generator pipeline should do the trick -- once the design is solid.
But my question is a code question, not a design question :-) I'll add framework code to the question, why not. It'll keep others from trying to help with the basic python stuff. Thanks for all your help!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.