I have a data frame df with a column called "Num_of_employees", which has values like 50-100, 200-500 etc. I see a problem with few values in my data. Wherever the employee number should be 1-10, the data has it as 10-Jan. Also, wherever the value should be 11-50, the data has it as Nov-50. How would I rectify this problem using pandas?
1 Answer
A clean syntax for this kind of "find and replace" uses a dict, as
df.Num_of_employees = df.Num_of_employees.replace({"10-Jan": "1-10",
"Nov-50": "11-50"})
8 Comments
Joe T. Boka
If you have a large data set, it might be impossible to use replace like this manually.
ComplexData
@JoeR Right! Is there a way which I can implement on large data?
piRSquared
I ran this over 100,000,000 rows and finished in a couple of seconds. IMO, this is your solution.
piRSquared
@user6461192 yes. There cannot be very many types. you can find them all with
df.Num_of_employees.unique() or df.Num_of_employees.value_counts() create a dictionary with all offending entries and the corresponding corrections.piRSquared
you might not be assigning the result back to the column.
df.Num_of_employees.replace({'10-Jan': '1-10', 'Nov-50': '11-50'}) will display the results but you have to capture them with df.Num_of_employees = df.Num_of_employees.replace({'10-Jan': '1-10', 'Nov-50': '11-50'}). You can check before you write your file with print(df.to_csv()) |