3

As an educational exercise, I've set out to write a Python lexer in Python. Eventually, I'd like to implement a simple subset of Python that can run itself, so I want this lexer to be written in a reasonably simple subset of Python with as few imports as possible.

The tutorials I have found involving lexing, for instance kaleidoscope, look ahead a single character to determine what token should come next, but I am afraid this is insufficient for Python (for one thing, just looking at one character you can't differentiate between a delimiter or operator, or between an identifier and a keyword; furthermore, handling indentations look like a new beast to me; among other things).

I have found this link to be very helpful, however, when I tried implementing it, my code quickly started looking pretty ugly with a lot of if statements and casework, and it didn't seem like it was the 'right' way to do it.

Are there any good resources out there that would help/teach me lex this kind of code (I'd also like to fully parse it, but first things first right?)?

I am not above using parser generators, but I want the resulting Python code to use a simple subset of Python, and also be reasonably self contained so that I can at least dream of having a language that can interpret itself. (For instance, from what I understand looking at this example, if I use ply, I will need my language to interpret the ply package as well to interpret itself, which I imagine would make things more complicated).

1
  • 1
    It's common for lexers to have a lot of condition checking. That's why you put it into a lexer, so the if statements don't show up everywhere else in the code. Commented May 22, 2012 at 7:47

4 Answers 4

2

Look at http://pyparsing.wikispaces.com/ maybe you found it useful for your task.

Sign up to request clarification or add additional context in comments.

1 Comment

Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing
1

I have used the traditional flex/lex & bison/yacc for similar projects in the past. I have also used ply (python lex yacc), and I found the skills very transferable from one to the other.

So if you have never written a parser before, I would write your first one using ply and you'll learn some useful skills for later projects.

When you get your ply parser working then you can make one by hand as an educational exercise. Writing lexers and parsers by hand gets really messy really quickly in my experience -hence the success of the parser generators!

Comments

0

Consider looking at PyPy, a python-based python implementation. It obviously has a python parser, too.

1 Comment

Their lexer is written in terms of a state machine. It's not just a state machine (like any sensible lexer should be), they describe the tokens as data structures resembling state machines and generate a table-driven lexer from it. I'm not sure if that's a good starting point for a beginner.
0

This simple regex-based lexer has served me a few times, quite well:

#-------------------------------------------------------------------------------
# lexer.py
#
# A generic regex-based Lexer/tokenizer tool.
# See the if __main__ section in the bottom for an example.
#
# Eli Bendersky ([email protected])
# This code is in the public domain
# Last modified: August 2010
#-------------------------------------------------------------------------------
import re
import sys


class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        # All the regexes are concatenated into a single one
        # with named groups. Since the group names must be valid
        # Python identifiers, but the token types used by the 
        # user are arbitrary strings, we auto-generate the group
        # names and map them to token types.
        #
        idx = 1
        regex_parts = []
        self.group_type = {}

        for regex, type in rules:
            groupname = 'GROUP%s' % idx
            regex_parts.append('(?P<%s>%s)' % (groupname, regex))
            self.group_type[groupname] = type
            idx += 1

        self.regex = re.compile('|'.join(regex_parts))
        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf, self.pos)

                if m:
                    self.pos = m.start()
                else:
                    return None

            m = self.regex.match(self.buf, self.pos)
            if m:
                groupname = m.lastgroup
                tok_type = self.group_type[groupname]
                tok = Token(tok_type, m.group(groupname), self.pos)
                self.pos = m.end()
                return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print(tok)
    except LexerError as err:
        print('LexerError at position %s' % err.pos)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.