Resources for Lexing Python

Question

As an educational exercise, I've set out to write a Python lexer in Python. Eventually, I'd like to implement a simple subset of Python that can run itself, so I want this lexer to be written in a reasonably simple subset of Python with as few imports as possible.

The tutorials I have found involving lexing, for instance kaleidoscope, look ahead a single character to determine what token should come next, but I am afraid this is insufficient for Python (for one thing, just looking at one character you can't differentiate between a delimiter or operator, or between an identifier and a keyword; furthermore, handling indentations look like a new beast to me; among other things).

I have found this link to be very helpful, however, when I tried implementing it, my code quickly started looking pretty ugly with a lot of if statements and casework, and it didn't seem like it was the 'right' way to do it.

Are there any good resources out there that would help/teach me lex this kind of code (I'd also like to fully parse it, but first things first right?)?

I am not above using parser generators, but I want the resulting Python code to use a simple subset of Python, and also be reasonably self contained so that I can at least dream of having a language that can interpret itself. (For instance, from what I understand looking at this example, if I use ply, I will need my language to interpret the ply package as well to interpret itself, which I imagine would make things more complicated).

It's common for lexers to have a lot of condition checking. That's why you put it into a lexer, so the if statements don't show up everywhere else in the code. — Emil Vikström
– Emil Vikström, Commented May 22, 2012 at 7:47

Sergey Miryanov · Accepted Answer · 2012-05-22 07:52:15Z

2

Look at http://pyparsing.wikispaces.com/ maybe you found it useful for your task.

answered May 22, 2012 at 7:52

Sergey Miryanov

1,83016 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PaulMcG Over a year ago

Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

rici · Accepted Answer · 2017-11-19 19:54:36Z

1

I have used the traditional flex/lex & bison/yacc for similar projects in the past. I have also used ply (python lex yacc), and I found the skills very transferable from one to the other.

So if you have never written a parser before, I would write your first one using ply and you'll learn some useful skills for later projects.

When you get your ply parser working then you can make one by hand as an educational exercise. Writing lexers and parsers by hand gets really messy really quickly in my experience -hence the success of the parser generators!

edited Nov 19, 2017 at 19:54

rici

243k30 gold badges263 silver badges364 bronze badges

answered May 22, 2012 at 7:56

Nick Craig-Wood

54.3k13 gold badges131 silver badges138 bronze badges

Comments

ThiefMaster · Accepted Answer · 2012-05-22 07:54:00Z

0

Consider looking at PyPy, a python-based python implementation. It obviously has a python parser, too.

answered May 22, 2012 at 7:54

ThiefMaster

320k85 gold badges608 silver badges648 bronze badges

1 Comment

user395760 Over a year ago

Their lexer is written in terms of a state machine. It's not just a state machine (like any sensible lexer should be), they describe the tokens as data structures resembling state machines and generate a table-driven lexer from it. I'm not sure if that's a good starting point for a beginner.

Eli Bendersky · Accepted Answer · 2012-05-22 08:36:10Z

This simple regex-based lexer has served me a few times, quite well:

#-------------------------------------------------------------------------------
# lexer.py
#
# A generic regex-based Lexer/tokenizer tool.
# See the if __main__ section in the bottom for an example.
#
# Eli Bendersky ([email protected])
# This code is in the public domain
# Last modified: August 2010
#-------------------------------------------------------------------------------
import re
import sys


class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        # All the regexes are concatenated into a single one
        # with named groups. Since the group names must be valid
        # Python identifiers, but the token types used by the 
        # user are arbitrary strings, we auto-generate the group
        # names and map them to token types.
        #
        idx = 1
        regex_parts = []
        self.group_type = {}

        for regex, type in rules:
            groupname = 'GROUP%s' % idx
            regex_parts.append('(?P<%s>%s)' % (groupname, regex))
            self.group_type[groupname] = type
            idx += 1

        self.regex = re.compile('|'.join(regex_parts))
        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf, self.pos)

                if m:
                    self.pos = m.start()
                else:
                    return None

            m = self.regex.match(self.buf, self.pos)
            if m:
                groupname = m.lastgroup
                tok_type = self.group_type[groupname]
                tok = Token(tok_type, m.group(groupname), self.pos)
                self.pos = m.end()
                return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print(tok)
    except LexerError as err:
        print('LexerError at position %s' % err.pos)

Collectives™ on Stack Overflow

Resources for Lexing Python

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related