2

I'm trying to extract a floating value from a string for a particular column.

Original Output

DATE        strCondition
4/3/2018    2.9
4/3/2018    3.1, text
4/3/2018    2.6 text
4/3/2018    text, 2.7 

and other variations. I've also tried regex but my knowledge here is limited, I've come up with:

clean = df['strCondition'].str.contains('\d+km')
df['strCondition'] = df['strCondition'].str.extract('(\d+)', expand = False).astype(float)

where the output ends up looking like this where it displays the main integer shown...

DATE        strCondition
4/3/2018    2.0
4/3/2018    3.0
4/3/2018    2.0
4/3/2018    2.0 

My desired output would be along the lines of:

DATE        strCondition
4/3/2018    2.9
4/3/2018    3.1
4/3/2018    2.6
4/3/2018    2.7 

I appreciate your time and inputs!

EDIT: I forgot to mention that in my original dataframe there are strCondition entries similar to

2.9(1.0) #where I would like both numbers to get returned
11/11/2018 #where this date as a string object can be discarded 

Sorry for the inconvenience!

3
  • \d+km doesn't match anything in the string. Commented Nov 11, 2019 at 19:40
  • Try df['float'] = df['strCondition'].str.findall(r'\d+(?:\.\d+)?').apply(', '.join) Commented Nov 11, 2019 at 21:36
  • To avoid matching digits in dates you may use df['float'] = df['strCondition'].str.findall(r'\b(?<!\d/)\d+(?:\.\d+)?\b(?!/\d)').apply(', '.join). Does it work like expected now? Commented Dec 6, 2019 at 11:47

3 Answers 3

10

Try:

df['float'] = df['strCondition'].str.extract(r'(\d+.\d+)').astype('float')

Output:

       DATE strCondition  float
0  4/3/2018          2.9    2.9
1  4/3/2018    3.1, text    3.1
2  4/3/2018     2.6 text    2.6
3  4/3/2018    text, 2.7    2.7
Sign up to request clarification or add additional context in comments.

2 Comments

So, you want the first float and not the ones inside parenthesis?
Ideally it would be cool to have it spit out both float values, even the one inside parenthesis. If we can only return only the float outside of the parenthesis that is a good solution as well.
0

A simple replace would be

Find (?m)^([\d/]+[ \t]+).*?(\d+\.\d+).*

Replace \1\2

https://regex101.com/r/pVC4jc/1

2 Comments

That does work for some of the strings, but what about the dates as a string in my original data?
@JQJQ - when a determination is made on the complete sample text expected to match and what it will look like after substituted, then a complete solution can be made.
0

In case you also want to convert negative numbers with a minus sign at the beginning, this should do it:

df['strCondition'] = df['strCondition'].str.extract('(-{0,1}\d+\.\d+)', expand=False).astype(float)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.