re.sub erroring with "Expected string or bytes-like object"

Question

I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function:

def fix_Plan(location):
    letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          location)     # Column and row to search    

    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    return (" ".join(meaningful_words))

col_Plan = fix_Plan(train["Plan"][0])
num_responses = train["Plan"].size
clean_Plan_responses = []

for i in range(0,num_responses):
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))

Here is the error:

Traceback (most recent call last):
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
    location)  # Column and row to search
  File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

If you are getting an error, always post the full error including the stack trace. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 1, 2017 at 22:48
Please print(train["Plan"][i]) and see what it is. Do it before the call to fix_Plan() in the for loop. I don't think train["Plan"][i] is what you expected to be. — Taku
– Taku, Commented May 1, 2017 at 22:50
It is a string from an excel document formatted like this: Video editing: Further develop video production skills using tools such as Wochit, Videolicious and iMovie. Develop a production plan specific to sports that matches effort to potential audience/impact. Expand HTML/CSS skills and identify one to two projects in Sports that could benefit from being presented in an HTML story then implement. — imanexcelnoob
– imanexcelnoob, Commented May 1, 2017 at 22:55
Are you sure it's a string? Try printing type(train['Plan'][i]) — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 1, 2017 at 22:57

Taku · Accepted Answer · 2017-05-01 23:08:27Z

178

As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub. The simplest way is to change location to str(location) when using re.sub. It wouldn't hurt to do it anyways even if it's already a str.

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(location))

answered May 1, 2017 at 23:08

Taku

34.1k12 gold badges79 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Zaira Zafar Over a year ago

I wrote two notebooks on in Jupyter and one in Kaggle Kernels. Jupyter one works fine and produces correct output. Kaggle Notebook gives me an error and I followed your solution and the error was removed but now sentiment prediction result it wrong.

mario · Accepted Answer · 2020-07-27 16:17:09Z

30

The simplest solution is to apply Python str function to the column you are trying to loop through.

If you are using pandas, this can be implemented as:

dataframe['column_name']=dataframe['column_name'].apply(str)

edited Jul 27, 2020 at 16:17

mario

11.4k1 gold badge35 silver badges47 bronze badges

answered Nov 1, 2019 at 7:30

msaif

3013 silver badges2 bronze badges

2 Comments

lowzhao Over a year ago

I would suggest fill nan values with '' dataframe['column_name'] = dataframe['column_name'].fillna('').apply(str) because in most use cases people will not want nan to be literal 'nan'

Simone Over a year ago

Worked perfectly for me. Thanks a lot! Wish I had read this 1.5h ago. The following converted the column in the DF but also replaced the content (which I do not want!) df['col'] = repr(df['col']) df['col'] = str(df['col']) df['col'] = df.col.astype('str') This threw an encoding error df['col'] = df.col.astype('|S')

Mostafa · Accepted Answer · 2024-05-14 19:08:07Z

4

I had the same problem. And it's very interesting that every time I did something, the problem was not solved until I realized that there were two special characters in the string.

For example, for me, the text has two characters:

&lrm; _{(Left-to-Right Mark)} and &zwnj; _{(Zero-width non-joiner)}

The solution for me was to delete these two characters and the problem was solved.

import re
mystring = "&lrm;Some Time W&zwnj;e"
mystring  = re.sub(r"&lrm;", "", mystring)
mystring  = re.sub(r"&zwnj;", "", mystring)

I hope this has helped someone who has a problem like me.

edited May 14, 2024 at 19:08

answered Apr 26, 2021 at 12:46

Mostafa

1,0591 gold badge15 silver badges26 bronze badges

Comments

cottontail · Accepted Answer · 2024-01-28 22:18:51Z

Use `str.replace` instead

This is about 7 years too late for OP but if you got here because you got a similar error by using re.sub on a pandas column, consider using str.replace built into pandas instead. The reason is that the most common reason this error pops up is when a pandas column contains (unexpected) NaN values in it which re.sub cannot handle whereas str.replace handles it under the hood for us.

Example:

train = pd.DataFrame({'Plan': ["th1s", '1s', 'N01ce', 'and', float('nan')]})

[re.sub("[^a-zA-Z]", " ", x) for x in train['Plan']]      # <--- TypeError: expected string or bytes-like object
train['Plan'].str.replace(r"[^a-zA-Z]", " ", regex=True)  # <--- OK

Now for OP, their fix_Plan function does more than just replacing strings; however, we can still do all of that in a vectorized way as follows (more or less replace re functions by its pandas counterparts).

stops = set(stopwords.words("english"))
stop_words = '|'.join(fr"\b{w}\b" for w in stops)  # pattern to catch stop words
clean_Plan_responses = (
    train['Plan']
    .str.replace("[^a-zA-Z]", " ", regex=True)     # replace all non-letters with spaces
    .str.lower()                                   # convert to lower case
    .str.replace(stop_words, "", regex=True)       # remove all stop words
    .str.split().str.join(" ")                     # remove extraneous space characters
)

Bilal Chandio · Accepted Answer · 2019-10-27 12:46:21Z

0

I suppose better would be to use re.match() function. here is an example which may help you.

import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP \n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]  
sentences

answered Oct 27, 2019 at 12:46

Bilal Chandio

892 silver badges10 bronze badges

1 Comment

Ben Slade Over a year ago

Why is it better to use the re.match() function?

stay_funn · Accepted Answer · 2022-06-23 13:20:52Z

0

from my experience in Python, this is caused by a None value in the second argument used in the function re.findall().

import re
x = re.findall(r"\[(.*?)\]", None)

One reproduce the error with this code sample.

To avoid this error message, one can filter the null values or add a condition to put them out of the processing

answered Jun 23, 2022 at 13:20

stay_funn

154 bronze badges

1 Comment

EvilSmurf Over a year ago

Please make sure to abstract to the generic. Sure: None could be a problem, but so could be a float or int. Like the error says: Anything that isn't a string or a byte-like object causes the error. If you limit it to a specific error case it may not be helpful

Collectives™ on Stack Overflow

re.sub erroring with "Expected string or bytes-like object"

6 Answers 6

1 Comment

2 Comments

Comments

Use `str.replace` instead

Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

2 Comments

Comments

Use str.replace instead

Comments

1 Comment

1 Comment

Linked

Related

Use `str.replace` instead