Python split text into tokens using regex

Question

Hi I have a question about splitting strings into tokens.

Here is an example string:

string = "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."

and I'm trying to split string correctly into its tokens.

Here is my function count_words

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    split = re.split("[\s.,!?:;'\"-]+",lowerText)
    print(split)
    # TODO: Aggregate word counts using a dictionary

and the result of split here

['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a', 'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he', 'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut', 'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left', 'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed', 'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it', 'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong', 'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and', 'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he', 'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling', 'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a', 'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for', 'the', 'more', 'favoured', 'of', 'his', 'guests', '']

as you see there is the empty string '' in the last index of the split list.

Please help me understand this empty string in the list and to correctly split this example string.

The short answer to solve your problem: use del split[-1] to remove that last element — Sssssuppp
– Sssssuppp, Commented Feb 17, 2019 at 14:54
Perhaps this page can be helpful stackoverflow.com/questions/16099694/… — The fourth bird
– The fourth bird, Commented Feb 17, 2019 at 14:54
What if you have a word like I'm? Should it get split? Note that rather than splitting, you may match all those strings, re.findall(r"[^\s.,!?:;'\"-]+", s). See this demo. The empty string is due to the fact the match is at the end of the string. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 17, 2019 at 15:58

DWD · Accepted Answer · 2019-02-17 14:32:56Z

You could use a list comprehension to iterate over the list items produced by re.split and only keep them if they are not empty strings:

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation 
    # (Hint: Use regex to split on non-alphanumeric characters) 

    split = re.split("[\s.,!?:;'\"-]+",lowerText)
    split = [x for x in split if x != '']  # <- list comprehension
    print(split)

You should also consider returning the data from the function, and printing it from the caller rather than printing it from within the function. That will provide you with flexibility in future.

Mohammed Elhag · Accepted Answer · 2019-02-17 22:12:20Z

2

That happened because the end of string is . and it is in the split pattern so , when match . the next match will start with an empty and that why you see ''.

I suggest this solution using re.findall instead to work an opposite way like this :

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    split = re.findall(r"[a-z\-]+", lowerText)
    print(split)
    # TODO: Aggregate word counts using a dictionary

edited Feb 17, 2019 at 22:12

answered Feb 17, 2019 at 15:21

Mohammed Elhag

4,3201 gold badge12 silver badges19 bronze badges

8 Comments

user11074789 Over a year ago

Hello, what is r and \- means in r"[a-z\-]+"?

user11074789 Over a year ago

and what is findall do?

Mohammed Elhag Over a year ago

@kimYumi r is prefix [a-z\-] is pattern to match from a up to z and - as well

Mohammed Elhag Over a year ago

@kimYumi re.findall return all non-overlapping matches of pattern in string, as a list of strings.

Mohammed Elhag Over a year ago

@kimYumi docs.python.org/2/library/…

|

Marco Luzzara · Accepted Answer · 2019-02-17 22:32:47Z

Python's wiki explains this behavior:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string

Even though yours is not actually a capturing group, the effect is the same. Note that it could be at the end as well as at the start (for instance if your string started with a whitespace).

The 2 solution already proposed (more or less) by others are these:

Solution 1: `findall`

As other users pointed out you can use findall and try to inverse the logic of the pattern. With yours, you can easily negate your character class: [^\s\.,!?:;'\"-]+.

But it depends on you regex pattern because it is not always that easy.

Solution 2: check on the starting and ending token

Instead of checking if each token is != '', you can just look at the first or at the last one of the tokens, since you are eagerly taking all the characters on the set you need to split on.

split = re.split("[\s\.,!?:;'\"-]+",lowerText)

if split[0] == '':
    split = split[1:]

if split[-1] == '':
    split = split[:-1]

Ulises Rosas-Puchuri · Accepted Answer · 2019-02-18 02:35:19Z

You have an empty string due to a point is also matching to split at the string ending and anything is downstream. You can, however, filter out empty strings with filter function and thus complete your function:

import re
import collections


def count_words(text):
    """Count how many times each unique word occurs in text."""

    lowerText = text.lower()

    split = re.split("[ .,!?:;'\"\-]+",lowerText)
    ## filer out empty strings and count
    ## words:

   return collections.Counter( filter(None, split) )


count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})

jayveesea · Accepted Answer · 2020-05-31 12:02:47Z

0

import string

def count_words(text):

    counts = dict() 

    text = text.translate(text.maketrans('', '', string.punctuation))
    text = text.lower()

    words = text.split()
    print(words)

    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

    return counts

It works.

edited May 31, 2020 at 12:02

jayveesea

3,25916 silver badges29 bronze badges

answered May 31, 2020 at 10:54

Lima

112 bronze badges

1 Comment

jayveesea Over a year ago

it could help to have some explanation with your answer.

Collectives™ on Stack Overflow

Python split text into tokens using regex

5 Answers 5

Comments

8 Comments

Solution 1: `findall`

Solution 2: check on the starting and ending token

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

8 Comments

Solution 1: findall

Solution 2: check on the starting and ending token

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Solution 1: `findall`