0

I have 4 columns in my pandas data frame with column names lets say, A,B,C,D each mapped to a field in UI. Each has its own purpose, however users are entering the field A information in either field A or B or C or D. I am trying to clean the data and bring it to column A for analysis. So if there is any value in column A, I don't care about values in B or C or D. But if there is no value in column A, then I have to look for user entry in other columns and bring it column A. Actual values for column A will always start with some values from our list. So, if there is no data in column A, then we have to look for the value in column B and see if that has the value from our list, then bring it to A, if column B is also null or if it has some other value than values from our list, leave it and check the same in column C, similarly in column D. How to do this in python?

Please let me know if anything is unclear.

Example,

mylist = ['senior','junior','midlevel']

inputdf

 A        B      C          D
senior  male   senior     UK
        senior candidate  USA
        female junior     
junior  male   junior     AU
        male   candidate  midlevel
        female candidate  AU


Outputdf,

A           B        C         D
senior     male     senior     UK
senior     senior  candidate   USA
junior     female  junior  
junior     male    junior      AU
midlevel   male    candidate  midlevel
           female  candidate  AU

1 Answer 1

1

You can use apply function to iterate through the df and return the value to the column 'A'.

def func(row):
    for index_val, series_val in row.iteritems():
        if (series_val in mylist):
            return series_val

df['A'] = df.apply(func, axis = 1)

This code checks if the value in A is present in mylist. If yes, then returns that value, else moves on and check B and then so on.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. However in some cases, mylist values present multiple times, example, no value in column A, but junior in column B and junior in column C. In that case, this will write duplicate in column A. How can we avoid it from checking further columns once it finds it first time.
Once the value is returned it does not make any duplicate comparisons. As soon as the return statement is executed the loop breaks and the function does not make any more comparisons. If you are still facing issues then you could add some more example (before and after running the code).
Thanks. Got you, but I am getting different error - AttributeError: ("'Series' object has no attribute 'columns'". I think apply function passes only one column at a time to the function.
Hi, thanks I sorted out the issue. I used for index_val, series_val in s.iteritems(): if series_val in mylist return series_val. Because apply function pass each row as a series with tuples in it. Please update this in your answers, I will then accept it.
I have made the changes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.