1

I have a dataframe that I previously transposed. Before the transposition, the numerical column had values float64 and that was expected. However, after the transpose, the float values turned into strings. I tried to convert the dataframe using the .as_type('float') but it got raised with an exception because some columns had values like '4.32 6.50' in the same cell.

I tried using a regex but when I get it to work in only returns something like:

False False False ... False False

my dataframe looks like this:

q1       q2   q3   q4
4.22     4.11 3.89 4.11
5        2.36 3.68 4.23
1.2 4.63 4.28 5.67 4.87

There are over 1000 rows, and there are multiple problematic rows in the dataframe. I don't know what they are. Therefore, manually removing it won't be an option

I tried the following code

final = final[~final['q1'].str.contains("\d+\.\d\s\d+\.\d", na = False)]

But, the problematic row is still there.

The final result looks like this

q1 q2 q3 q4

All the rows went gone. Not all of them are problematic

4
  • 1
    Why is your final result just the column names? You want to drop all the rows which have two values in one cell? Commented Jul 28, 2019 at 1:08
  • that's what happened when I executed the code right above. I'm not really sure why. And yes, the only thing I want is to drop those columns that have 2 values in a cell Commented Jul 28, 2019 at 1:11
  • Try final = final[~final['q1'].str.contains('\.{1}', na = False)] Commented Jul 28, 2019 at 1:18
  • What if all rows have whole numbers? @GustavoGradvohl Commented Jul 28, 2019 at 1:19

2 Answers 2

2

You were quite close with your regex, some small problems though.


Method 1, cleaning up in specific column

If you know which column is giving the problem, we can use str.contains on a specific column:

m = ~df['q1'].str.contains('\d+\.\d+\s\d+\.\d+')
df[m]

Output

     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1     5  2.36  3.68  4.23

Method 2, searching all columns

If you are not sure which column is giving the problem. We can use DataFrame.apply with .str.contains and then drop the rows which have any cells with multiple values:

m = ~df.apply(lambda x: x.str.contains('\d+\.\d+\s\d+\.\d+')).any(axis=1)
df[m]

Output

     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1     5  2.36  3.68  4.23

Method 3, remove rows with whitespace (kinda dangerous)

First we remove whitespace on left and right border, than remove rows which have whitespaces in between:

df = df.apply(lambda x: x.str.strip())

m = ~df.apply(lambda x: x.str.contains('\s')).any(axis=1)
df[m]

Output

     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1     5  2.36  3.68  4.23
Sign up to request clarification or add additional context in comments.

2 Comments

I think maybe it is better adding astype(float) at the end ?
@Erfan Thank you! Method 1 worked like a charm. Something I did notice is that all the conflicting rows will have the issue in the first column. For some reason, the other 2 methods didn't work, particularly the 3rd one. Oh well. I really appreciate your help and effort
1

Since you mentioned convert to numeric , we using to_numeric all the cell can not convert to numeric will be cast to NaN , the we dropna

df=df.apply(pd.to_numeric,errors ='coerce').dropna()
df
Out[388]: 
     q1    q2    q3    q4
0  4.22  4.11  3.89  4.11
1  5.00  2.36  3.68  4.23

1 Comment

I have tried that one before. All the values in the dataframe turned into a NaN

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.