2

I am working on a huge dataframe and trying to create a new column, based on a condition in another column. Right now, I have a big while-loop and this calculation takes too much time, is there an easier way to do it?

With lambda for example?:

def promo(dataframe, a):  
    i=0
    while i < len(dataframe)-1:
        i=i+1
        if dataframe.iloc[i-1,5] >= a:
            dataframe.iloc[i-1,6] = 1
        else:
            dataframe.iloc[i-1,6] = 0

    return dataframe
3
  • 1
    it's easier to understand if you provide some explanation of trying to create a new column, based on a condition in another column Commented Sep 11, 2018 at 6:56
  • If you're using loops with dataframe rows, you're usually doing something wrong. Please show example input and output Commented Sep 11, 2018 at 6:58
  • 1
    @cricket_007 particularly indexing for-loops. If you have to loop over rows, then you should at least use itertuples. Commented Sep 11, 2018 at 7:02

1 Answer 1

2

Don't use loops in pandas, they are slow compared to a vectorized solution - convert boolean mask to integers by astype True, False are converted to 1, 0:

dataframe = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':list('aaabbb'),
                   'F':[5,3,6,9,2,4],
                   'G':[5,3,6,9,2,4]
})

a = 5
dataframe['new'] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
   A  B  C  D  E  F  G  new
0  a  4  7  1  a  5  5    1
1  b  5  8  3  a  3  3    0
2  c  4  9  5  a  6  6    1
3  d  5  4  7  b  9  9    1
4  e  5  2  1  b  2  2    0
5  f  4  3  0  b  4  4    0

If you want to overwrite the 7th column:

a = 5
dataframe.iloc[:,6] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
   A  B  C  D  E  F  G
0  a  4  7  1  a  5  1
1  b  5  8  3  a  3  0
2  c  4  9  5  a  6  1
3  d  5  4  7  b  9  1
4  e  5  2  1  b  2  0
5  f  4  3  0  b  4  0
Sign up to request clarification or add additional context in comments.

11 Comments

@M.T - Thank you, +3
I do not understand why you are using iloc[]. In this case I would use: df.loc[:, 5] or df[5]. Can you explain me if there is any difference in performance?
@AntonioAndrés - It depends if want seelct 6th column like in my solution, then need dataframe.iloc[:,5]. But if default columns names by RangeIndex then is possible use df[5] or df.loc[:, 5]
Ok, I understand you, iloc[] is for position also in columns not only in rows. Thanks!
@PV8 - Then use dataframe['new'] = np.where(dataframe.iloc[:,5] >= a, 1, 2)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.