2

I'm using the pandas df.str.replace() function and would like to remove multiple characters from the string.

I'm trying to clean up some transaction data in a CSV file using pandas. I have a column that is storing the amount of the transaction as an Object data type. Before I can change it to a float datatype, I need to remove the $ character and any , characters from numbers greater than 999.99. I've been able to do this one at a time; however, I'd like to know if I can pass in multiple values to clean it up.

2 8/20/2019 Utah Valley Univ UTAH VALLEY UNIV UVU PMT 1 908191 4,825.50

df['Amount'] = df['Amount'].str.replace(r',','').astype(float)

I'd like to remove the '$' and the ',' character at the same time if possible.

2
  • 3
    df['Amount'] = df['Amount'].str.replace(r'\$|\,', '').astype(float)? Commented Aug 25, 2019 at 16:24
  • 1
    Dear @drewipson , you can choose an answer as given below or comment if anything needed further. Commented Aug 25, 2019 at 17:53

2 Answers 2

1

Taking liberty to borrow the DataFrame from @Ian>

There is another way of doing it with replace method and withng replace using dict method to replace multiple value across the column..

>>> df
    amount
0  $25,000
1  $13,000
2  $65,000
3  $19,000
4  $15,000

It will simple remove the $ sign and comma with null '' values .

>>> df['amount'].replace({'\$': '', ',': ''}, regex=True)
0    25000
1    13000
2    65000
3    19000
4    15000
Name: amount, dtype: object

Just to convert value to float use astype..

>>> df['amount'].replace({'\$': '', ',': ''}, regex=True).astype(float)
0    25000.0
1    13000.0
2    65000.0
3    19000.0
4    15000.0
Name: amount, dtype: float64
Sign up to request clarification or add additional context in comments.

Comments

0

Going to steal @political scientist's comment and make it an answer with a little explanation.

Using some fake data:

import pandas as pd
import numpy as np

np.random.seed(1)

df = pd.DataFrame(np.random.randint(5, 100, size=(5,)), columns=['amount']).applymap(str)

df.amount = '$' + df.amount + ',' + '000'

print(df)

    amount
0  $42,000
1  $17,000
2  $77,000
3  $14,000
4  $80,000

We have $ and , in our amount column. Using

df.amount.str.replace(r'\$|\,', '').astype(float)

We get

0    42000.0
1    17000.0
2    77000.0
3    14000.0
4    80000.0
Name: amount, dtype: float64

Why? By default the .str.replace() method has the parameter regex=True which means it accepts regular expressions for pattern matching.

  • The r at the front of the string tells the code to read the string as "raw"
  • \$ says to look for a dollar sign
  • | is the symbol for or
  • \, says to look for a comma

Using the | between the \$ and the \, (without a space!) means to look for either and replace them both with what is present at the second parameter in the method (aka repl)

Here is a cheat sheet I found that explains other regex characters and how to use them: Regex tutorial — A quick cheatsheet by examples

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.