4

I am working on a dataset which is in the following dataframe.

#print(old_df)
   col1 col2 col3
0   1   10  1.5
1   1   11  2.5
2   1   12  5,6
3   2   10  7.8
4   2   24  2.1
5   3   10  3.2
6   4   10  22.1
7   4   11  1.3
8   4   89  0.5
9   4   91  3.3

I am trying to generate another data frame which contains selected col1 values as index, selected col2 values as columns and assign respective col3 value.

Eg:

selected_col1 = [1,2]
selected_col2 = [10,11,24]

New data frame should be looking like:

#print(selected_df)
     10     11     24
1    1.5    2.5    Nan
2    7.8    Nan    2.1

I have tried following method

selected_col1 = [1,2]
selected_col2 = [10,11,24]
selected_df =pd.DataFrame(index=selected_col1,columns=selected_col2) 
for col1_value in selected_col1:
    for col2_value in selected_col2:
        qry = 'col1 == {} & col2 == {}'.format(col1_value,col2_value)
        col3_value = old_df.query(qry).col3.values
        if(len(col3_value) > 0):
            selected_df.at[col1_value,col2_value] = col3_value[0]

But because my dataframe has around 20 million rows, this brute force kind of method is taking long time. Is there a way better than this?

1 Answer 1

6

First filter rows by membership by Series.isin in both columns chained by & for bitwise AND and then use DataFrame.pivot:

df = df[df['col1'].isin(selected_col1) & df['col2'].isin(selected_col2)]

df = df.pivot('col1','col2','col3')
print (df)
col2   10   11   24
col1               
1     1.5  2.5  NaN
2     7.8  NaN  2.1

If possible some duplicated pairs in col1 with col2 after filtering use DataFrame.pivot_table:

df = df.pivot_table(index='col1',columns='col2',values='col3', aggfunc='mean')

EDIT:

If use | for bitwise OR get different output:

df = df[df['col1'].isin(selected_col1) | df['col2'].isin(selected_col2)]

df = df.pivot('col1','col2','col3')
print (df)
col2    10   11   12   24
col1                     
1      1.5  2.5  5,6  NaN
2      7.8  NaN  NaN  2.1
3      3.2  NaN  NaN  NaN
4     22.1  1.3  NaN  NaN
Sign up to request clarification or add additional context in comments.

8 Comments

I am getting the following error: ValueError: Unstacked DataFrame is too big, causing int32 overflow\n by the way i am using "|" instead "&" while initializing new dataframe
@SatheeshK - Do you need | for or, not & for and?
@SatheeshK - Unfortunately error means very large data, what is length of selected_col1 and selected_col2 lists?
I am using | for or only. len(selected_col1)= 1894 ,len(selected_col2)= 8546
@SatheeshK - If understand well, after filtering are removed only few rows. So it is reason for weird error, because large DataFrame. Also one thing - you can try upgrade to last pandas version, maybe help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.