3

Working on a dataframe in Python 3 Pandas that requires a new column to be created. I have two similar columns with different length strings. The new column should return either column 1 or 2 that has a 13 character length. In excel I would write it as: c2=if(len(b2)=13,b2,a2) then copy the formula down.

The code I need interpreted is:

df = pd.read_csv("example15.csv")

#create a new column with if-then statment
df['13_digit_#'] = (df.column1 len = 13 or df.column2 len = 13)

How would I rewrite the last line? Thanks much!

2
  • All the columns of your dataframe should return the same len(col) argument. That is, its not possible to have a dataframe with columns of different lengths. Do you mean some of the columns have missing observations and others do not? e.g. df[col1] = [a,b,c,d, N/A], df[col2] = [a,b,c,d, e]? Commented Oct 3, 2016 at 13:51
  • measure_theory - I meant that the results in each of those columns are either blank, have one or two digits, or have 13. Seeking to have the new column "clean up the data" by only giving the result with 13 characters in length. Commented Oct 3, 2016 at 14:06

2 Answers 2

3

I think you can use numpy.where with str.len or apply(len):

df['13_digit_#'] = np.where((df.column1.str.len() == 13) | 
                            (df.column2.str.len() == 13), 'a', 'b')

Or if other condition:

df['13_digit_#'] = np.where(df.column1.str.len() == 13, df.column1, df.column2)

Sample:

df = pd.DataFrame({'column1':['0123456789abc','a','b'],
                   'column2':['abcabcabcabca','c','d']})

print (df)
         column1        column2
0  0123456789abc  abcabcabcabca
1              a              c
2              b              d

df['13_digit_#'] = np.where(df.column1.str.len() == 13, df.column1, df.column2)
#df['13_digit_#'] = np.where(df.column1.apply(len) == 13, df.column1, df.column2)
print (df)
         column1        column2     13_digit_#
0  0123456789abc  abcabcabcabca  0123456789abc
1              a              c              c
2              b              d              d
Sign up to request clarification or add additional context in comments.

2 Comments

Used the if-other condition, that checks out. Thanks again jezrael! Its a huge dataset and got warnings: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer, col_indexer] = value instead. Its ok, the results work and is exporting nicely.
Glad can help you!
0

Assuming the blank, or missing, elements of each column are NaN, then the following will drop the column that doesn't have the full number of observations and will save it as new variable in your dataframe

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,np.nan], 'b':[1, np.nan, np.nan]})

df['newcol'] = df[['a','b']].dropna(axis = 1, how = 'any')

In the last line, axis = 1 tells the command to look at each column (a and b) and "how = 'any'" tells it to drop the column that has any missing values and saves it as 'newcol'.

1 Comment

Oh no I don't want to drop any data, either column will have the 13 digit string, I just want the new column to look at both old columns and use the value that has the 13 digit string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.