Delete rows in pandas given a regex

Question

I have a dataframe that I previously transposed. Before the transposition, the numerical column had values float64 and that was expected. However, after the transpose, the float values turned into strings. I tried to convert the dataframe using the .as_type('float') but it got raised with an exception because some columns had values like '4.32 6.50' in the same cell.

I tried using a regex but when I get it to work in only returns something like:

False False False ... False False

my dataframe looks like this:

q1       q2   q3   q4
4.22     4.11 3.89 4.11
5        2.36 3.68 4.23
1.2 4.63 4.28 5.67 4.87

There are over 1000 rows, and there are multiple problematic rows in the dataframe. I don't know what they are. Therefore, manually removing it won't be an option

I tried the following code

final = final[~final['q1'].str.contains("\d+\.\d\s\d+\.\d", na = False)]

But, the problematic row is still there.

The final result looks like this

q1 q2 q3 q4

All the rows went gone. Not all of them are problematic

Why is your final result just the column names? You want to drop all the rows which have two values in one cell? — Erfan
– Erfan, Commented Jul 28, 2019 at 1:08
that's what happened when I executed the code right above. I'm not really sure why. And yes, the only thing I want is to drop those columns that have 2 values in a cell — Laury Sorto
– Laury Sorto, Commented Jul 28, 2019 at 1:11
Try final = final[~final['q1'].str.contains('\.{1}', na = False)] — Gustavo Gradvohl
– Gustavo Gradvohl, Commented Jul 28, 2019 at 1:18

Erfan · Accepted Answer · 2019-07-28 01:23:39Z

2

You were quite close with your regex, some small problems though.

Method 1, cleaning up in specific column

If you know which column is giving the problem, we can use str.contains on a specific column:

m = ~df['q1'].str.contains('\d+\.\d+\s\d+\.\d+')
df[m]

Output

     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1     5  2.36  3.68  4.23

Method 2, searching all columns

If you are not sure which column is giving the problem. We can use DataFrame.apply with .str.contains and then drop the rows which have any cells with multiple values:

m = ~df.apply(lambda x: x.str.contains('\d+\.\d+\s\d+\.\d+')).any(axis=1)
df[m]

Output

     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1     5  2.36  3.68  4.23

Method 3, remove rows with whitespace (kinda dangerous)

First we remove whitespace on left and right border, than remove rows which have whitespaces in between:

df = df.apply(lambda x: x.str.strip())

m = ~df.apply(lambda x: x.str.contains('\s')).any(axis=1)
df[m]

Output

     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1     5  2.36  3.68  4.23

edited Jul 28, 2019 at 1:23

answered Jul 28, 2019 at 1:16

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BENY Over a year ago

I think maybe it is better adding astype(float) at the end ?

Laury Sorto Over a year ago

@Erfan Thank you! Method 1 worked like a charm. Something I did notice is that all the conflicting rows will have the issue in the first column. For some reason, the other 2 methods didn't work, particularly the 3rd one. Oh well. I really appreciate your help and effort

BENY · Accepted Answer · 2019-07-28 01:46:57Z

1

Since you mentioned convert to numeric , we using to_numeric all the cell can not convert to numeric will be cast to NaN , the we dropna

df=df.apply(pd.to_numeric,errors ='coerce').dropna()
df
Out[388]: 
     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1  5.00  2.36  3.68  4.23

answered Jul 28, 2019 at 1:46

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Laury Sorto Over a year ago

I have tried that one before. All the values in the dataframe turned into a NaN

Collectives™ on Stack Overflow

Delete rows in pandas given a regex

2 Answers 2

Method 1, cleaning up in specific column

Method 2, searching all columns

Method 3, remove rows with whitespace (kinda dangerous)

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Method 1, cleaning up in specific column

Method 2, searching all columns

Method 3, remove rows with whitespace (kinda dangerous)

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related