Group by rows with null values in pandas data frame

Question

DetectedText	Confidence	image_name	ref	all_text
INY	73.215164	14.jpeg	NaN	NaN
9:36	91.633514	14.jpeg	NaN	NaN
MICKEYD19	89.422897	14.jpeg	NaN	NaN
Ln	59.588081	14.jpeg	NaN	NaN
ADULT	98.488983	14.jpeg	9b01dc1e	NaN
Ln	59.588081	15.jpeg	NaN	NaN
ADULT	98.488983	15.jpeg	NaN	NaN

This is what my dataframe looks like and want to group rows into one by IMAGE_NAME and merge the contents of DETECTEDTEXT into ALL_TEXT and keep the REF where REF has a non-null value and the corresponding CONFIDENCE row. If the same image(15.jpeg) has all null values in the REF column then want to merge the DETECTEDTEXTinto ALL_TEXT, change the CONFIDENCE to Null.

Expected result:

Confidence	image_name	ref	all_text
98.488983	14.jpeg	9b01dc1e	INY 9:36 MICKEYD19 Ln ADULT
NaN	15.jpeg	NaN	Ln ADULT

I tried using groupby for my each requirements individually but the error I get is `TypeError: sequence item 0: expected string, int found

Will 'REF' column contain either string type values and NaN only? — sharathnatraj
– sharathnatraj, Commented Dec 10, 2020 at 0:47

sharathnatraj · Accepted Answer · 2020-12-10 02:13:40Z

Please try:

Option#1:

df1 = df.sort_values(['IMAGE_NAME','REF'], ascending=False)
df1 = df1.groupby('IMAGE_NAME').agg({'DETECTEDTEXT' : ' '.join , 'REF': 'last','CONFIDENCE':'last'}).reset_index()[['IMAGE_NAME','REF','CONFIDENCE','DETECTEDTEXT']]
df1.loc[df1['REF'].isnull(), 'CONFIDENCE'] = np.NaN
df1.rename(columns={'DETECTEDTEXT':'ALL_TEXT'},inplace=True)

Option#2

df1 = df.fillna('0')
df1 = df1.groupby('IMAGE_NAME').agg({'DETECTEDTEXT' : ' '.join , 'REF': 'max'}).reset_index()
df1 = df1.merge(df,on=['IMAGE_NAME','REF'], how='left')[['IMAGE_NAME','REF','CONFIDENCE','DETECTEDTEXT_x']]
df1 = df1.rename(columns={'DETECTEDTEXT_x' : 'ALL_TEXT'})
df1['REF'] = df1.REF.replace('0',np.NaN)

Both Prints:

  IMAGE_NAME       REF  CONFIDENCE                     ALL_TEXT
0    14.jpeg  9b01dc1e   98.488983  INY 9:36 MICKEYD19 Ln ADULT
1    15.jpeg       NaN         NaN                     Ln ADULT

Input df:

  DETECTEDTEXT  CONFIDENCE IMAGE_NAME       REF  ALL_TEXT
0          INY   73.215164    14.jpeg       NaN       NaN
1         9:36   91.633514    14.jpeg       NaN       NaN
2    MICKEYD19   89.422897    14.jpeg       NaN       NaN
3           Ln   59.588081    14.jpeg       NaN       NaN
4        ADULT   98.488983    14.jpeg  9b01dc1e       NaN
5           Ln   59.588081    15.jpeg       NaN       NaN
6        ADULT   98.488983    15.jpeg       NaN       NaN

Option#1: Option#1 is more elegant and came to me after I wrote Option#2. Just sorting the IMAGE_NAMe & 'REF' combo and using groupby.

Option#2: First replacing all NaNs to zeros for ease of calculation,the groupby with 'REF' : 'MAX returns 9b01dc1e for 14.jpeg and 0 for 15.jpeg. Now using pd.merge, pick the 'confidence' score corresponding to those REF values. For 14.jpeg, it returns the correct match for 9b01dc1e from original df and for 15.jpeg, it returns NaN since there is no match for 0 in the original df. So we get the required input.

Note: The code may need some changes if you can have multiple not null REF values for the same image. If so, we might have to do some other pre-processing as well. Other than that, this should work.

Collectives™ on Stack Overflow

Group by rows with null values in pandas data frame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related