Building an index of term usage in python code

Question

Brief version

I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).

Clarification: This is about indexing python source code, not English text that talks about python.

Background

My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".

I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.

The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]

Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:

for source in source_files:
    with open(source) as fp:
        code = fp.read()
        tree = ast.parse(code)
        for node in tree.walk():
           ... # Get node's keyword, identifier etc., and line number-- how?
               
           print(term, source, line)   # I do know how to make an index

So, is this a reasonable approach? Is there a better one? How should this be done?

Thanks for the reminder, but I don't have a favorite answer to this question yet. — alexis
– alexis, Commented Jan 28, 2016 at 23:53

Prune · Accepted Answer · 2015-10-06 18:04:47Z

2

Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.

I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.

That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).

Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.

Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.

Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)

Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.

This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.

edited Oct 6, 2015 at 18:04

answered Oct 6, 2015 at 16:48

Prune

78k14 gold badges63 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

alexis Over a year ago

Good point about "indexing", I hadn't thought of that. But I see I didn't make it completely clear that I want to index python code (not the English text, God forbid-- that's a completely different task). Recognizing keywords is trivial, as long as my parser knows not to look in literal docstrings and the like.

alexis Over a year ago

To answer your other question, indeed I'll build an index of python keywords and builtins. Things like comprehensions are harder to recognize, but for them I can just scan the text (I know how to do that, it's not part of the question). I hope the question is clearer now. Maybe I should have just asked about how to use ast for tokenization...

alexis Over a year ago

Thanks! But I don't have to provide my own BNF for all of python, do I? I thought it comes built-in with ast. Could you provide a code snippet? I don't need any help with building the index, but how do I work with the (presumably) nodes returned by ast.walk()? I've looked at the documentation, and they're... complicated.

Prune Over a year ago

Sorry; I haven't worked with ast, just the concepts. My work has been with multi-language parsing, with blended BNFs. "How do I work with" requires a lot more definition -- what is your pseudo-code, and how far did you get with the actual coding? Design discussions shouldn't be in SO questions -- move to chat? As for being complicated, the node format is designed by the input grammar, I think. We need to identify the fields and values to target for you, and simply search the node stream for those. A generator pipeline should do the trick -- once the design is solid.

alexis Over a year ago

But my question is a code question, not a design question :-) I'll add framework code to the question, why not. It'll keep others from trying to help with the basic python stuff. Thanks for all your help!

|

Collectives™ on Stack Overflow

Building an index of term usage in python code

Brief version

Background

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Brief version

Background

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related