Hi I have a question about splitting strings into tokens.
Here is an example string:
string= "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."
and I'm trying to split string correctly into its tokens.
Here is my function count_words
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
and the result of split here
['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a', 'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he', 'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut', 'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left', 'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed', 'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it', 'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong', 'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and', 'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he', 'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling', 'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a', 'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for', 'the', 'more', 'favoured', 'of', 'his', 'guests', '']
as you see there is the empty string '' in the last index of the split list.
Please help me understand this empty string in the list and to correctly split this example string.
del split[-1]to remove that last elementI'm? Should it get split? Note that rather than splitting, you may match all those strings,re.findall(r"[^\s.,!?:;'\"-]+", s). See this demo. The empty string is due to the fact the match is at the end of the string.