0

I've been trying to make a tokenizer for a language as an exercise . for example I'm trying to tokenize the code below

num vecsum(vec A)
{
num n;
n = (5 + 2);
return n;
}

and I've been trying to use this regex

re.findall("[\'\w\-]+",text)

but I get outputs like this one : vecsum(vec

while I want to get it like this : ["vecsum" , "(" , "vec" ]

I want it to understand that even if there isn't white space there it should split stuff like " ; " and " ( "

8
  • ' is already part of \S, no need to add that separately. Commented Nov 11, 2018 at 17:28
  • Make sure to use the r before strings that have literal backslashes. Commented Nov 11, 2018 at 17:35
  • Do you mean this: findall(r"\w+|[^\w\s]+", text)? Or similar: r"\S+?(?:\b|$)". Commented Nov 11, 2018 at 17:38
  • @trincot: that still puts symbols together into one token: if(foo){bar++;baz--} would result in ['if', '(', 'foo', '){', 'bar', '++;', 'baz', '--}']; note the joining of ++; and --}. Commented Nov 11, 2018 at 17:54
  • Just trying to get some info from PO who apparently chooses to remain silent. ;-) Commented Nov 11, 2018 at 17:58

1 Answer 1

2

Tokenizing a C-like language requires more work than just splitting on whitespace (which is what you are doing right now).

There are at least 3 types of tokens in such languages:

  • Single character tokens, such as (, ), ;, + and =
  • Multi-character tokens, identifiers, keywords, numbers.
  • Strings; everything between an opening quote and the matching closing quote (with support for escapes, and varying degrees of support for special kinds of strings that could, say, contain newlines).

I'm ignoring comments here, which are either defined as running from a starting sequence until the end of a line (# ..., // ..., etc.) or from starting sequence to ending sequence any number of lines away (/* .... */).

You could define a regex that can tokenize the first two types, and then perhaps use the output of that to handle strings (when you get a " token find the next " token without a \ token right in front, then take everything in between (whitespace and all) as the string).

Such a tokenizer the needs at least two groups, for single characters and multi-character tokens. The multi-character tokens are a further group of options:

r'(?:[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'

I used operators in C and C++ on Wikipedia as a guide on what single-character tokens to look for.

For your sample input, this produces:

['num', 'vecsum', '(', 'vec', 'A', ')', '{', 'num', 'n', ';', 'n', '=', '(', '5', '+', '2', ')', ';', 'return', 'n', ';', '}']

If you must parse multi-symbol operators as single tokens, then you also must include these as separate patterns in the regex, e.g.

(
    r'(?:==|!=|>=|<=|&&|\|\||<<|>>|->|::|\+\+|--|+=|-='
    r'|\*=|/=|%=|<<=|>>=|&=|\^=|\|='
    r'|[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'
)

but then you are half-way to a full-blown tokenizer that defines patterns for each type of literal and keyword and you may as well start breaking out this huge regex into such constituent parts. See the Python tokenize module source code for an example of such a tokenizer; it builds up a large regex from component parts to produce typed tokens.

The alternative is to stick with the super-simple 2-part tokenizer regex and use re.finditer() to make decisions about tokens in context. With a start and end position in the string, you can detect that = was directly preceded by =, and so know you have the == comparison operator rather than two assignments. I used this in a simple parser for a SQLite full-text search query language before, (look for the _terms_from_query() method in the code for this answer), if you want to see an example.

Sign up to request clarification or add additional context in comments.

3 Comments

as u said i currently have this : re.findall("\w+(?:'\w+)?|[^\w\s]",text) which works , but it doesn't recognise multi character tokens like == or <= etc
i tested your code and still had the same problem . it recognised == as 2 different tokens
@mahditalebi I didn’t say my approach would recognise multi-character operators. You’d have to add those as separate groups before the single-character operator pattern. Serious regex-based tokensizers use a large number of specific patterns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.