regex - making a tokenizer using regular expressions in Python 3

Question

I've been trying to make a tokenizer for a language as an exercise . for example I'm trying to tokenize the code below

num vecsum(vec A)
{
num n;
n = (5 + 2);
return n;
}

and I've been trying to use this regex

re.findall("[\'\w\-]+",text)

but I get outputs like this one : vecsum(vec

while I want to get it like this : ["vecsum" , "(" , "vec" ]

I want it to understand that even if there isn't white space there it should split stuff like " ; " and " ( "

' is already part of \S, no need to add that separately. — Martijn Pieters
– Martijn Pieters, Commented Nov 11, 2018 at 17:28
Make sure to use the r before strings that have literal backslashes. — trincot
– trincot, Commented Nov 11, 2018 at 17:35
Do you mean this: findall(r"\w+|[^\w\s]+", text)? Or similar: r"\S+?(?:\b|$)". — trincot
– trincot, Commented Nov 11, 2018 at 17:38
@trincot: that still puts symbols together into one token: if(foo){bar++;baz--} would result in ['if', '(', 'foo', '){', 'bar', '++;', 'baz', '--}']; note the joining of ++; and --}. — Martijn Pieters
– Martijn Pieters, Commented Nov 11, 2018 at 17:54
Just trying to get some info from PO who apparently chooses to remain silent. ;-) — trincot
– trincot, Commented Nov 11, 2018 at 17:58

Martijn Pieters · Accepted Answer · 2018-11-12 14:32:39Z

2

Tokenizing a C-like language requires more work than just splitting on whitespace (which is what you are doing right now).

There are at least 3 types of tokens in such languages:

Single character tokens, such as (, ), ;, + and =
Multi-character tokens, identifiers, keywords, numbers.
Strings; everything between an opening quote and the matching closing quote (with support for escapes, and varying degrees of support for special kinds of strings that could, say, contain newlines).

I'm ignoring comments here, which are either defined as running from a starting sequence until the end of a line (# ..., // ..., etc.) or from starting sequence to ending sequence any number of lines away (/* .... */).

You could define a regex that can tokenize the first two types, and then perhaps use the output of that to handle strings (when you get a " token find the next " token without a \ token right in front, then take everything in between (whitespace and all) as the string).

Such a tokenizer the needs at least two groups, for single characters and multi-character tokens. The multi-character tokens are a further group of options:

r'(?:[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'

I used operators in C and C++ on Wikipedia as a guide on what single-character tokens to look for.

For your sample input, this produces:

['num', 'vecsum', '(', 'vec', 'A', ')', '{', 'num', 'n', ';', 'n', '=', '(', '5', '+', '2', ')', ';', 'return', 'n', ';', '}']

If you must parse multi-symbol operators as single tokens, then you also must include these as separate patterns in the regex, e.g.

(
    r'(?:==|!=|>=|<=|&&|\|\||<<|>>|->|::|\+\+|--|+=|-='
    r'|\*=|/=|%=|<<=|>>=|&=|\^=|\|='
    r'|[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'
)

but then you are half-way to a full-blown tokenizer that defines patterns for each type of literal and keyword and you may as well start breaking out this huge regex into such constituent parts. See the Python tokenize module source code for an example of such a tokenizer; it builds up a large regex from component parts to produce typed tokens.

The alternative is to stick with the super-simple 2-part tokenizer regex and use re.finditer() to make decisions about tokens in context. With a start and end position in the string, you can detect that = was directly preceded by =, and so know you have the == comparison operator rather than two assignments. I used this in a simple parser for a SQLite full-text search query language before, (look for the _terms_from_query() method in the code for this answer), if you want to see an example.

edited Nov 12, 2018 at 14:32

answered Nov 11, 2018 at 17:41

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mahdi Talebi Over a year ago

as u said i currently have this : re.findall("\w+(?:'\w+)?|[^\w\s]",text) which works , but it doesn't recognise multi character tokens like == or <= etc

Mahdi Talebi Over a year ago

i tested your code and still had the same problem . it recognised == as 2 different tokens

Martijn Pieters Over a year ago

@mahditalebi I didn’t say my approach would recognise multi-character operators. You’d have to add those as separate groups before the single-character operator pattern. Serious regex-based tokensizers use a large number of specific patterns.

Collectives™ on Stack Overflow

regex - making a tokenizer using regular expressions in Python 3

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related