1
DetectedText Confidence image_name ref all_text
INY 73.215164 14.jpeg NaN NaN
9:36 91.633514 14.jpeg NaN NaN
MICKEYD19 89.422897 14.jpeg NaN NaN
Ln 59.588081 14.jpeg NaN NaN
ADULT 98.488983 14.jpeg 9b01dc1e NaN
Ln 59.588081 15.jpeg NaN NaN
ADULT 98.488983 15.jpeg NaN NaN

This is what my dataframe looks like and want to group rows into one by IMAGE_NAME and merge the contents of DETECTEDTEXT into ALL_TEXT and keep the REF where REF has a non-null value and the corresponding CONFIDENCE row. If the same image(15.jpeg) has all null values in the REF column then want to merge the DETECTEDTEXTinto ALL_TEXT, change the CONFIDENCE to Null.

Expected result:

Confidence image_name ref all_text
98.488983 14.jpeg 9b01dc1e INY 9:36 MICKEYD19 Ln ADULT
NaN 15.jpeg NaN Ln ADULT

I tried using groupby for my each requirements individually but the error I get is `TypeError: sequence item 0: expected string, int found

1
  • Will 'REF' column contain either string type values and NaN only? Commented Dec 10, 2020 at 0:47

1 Answer 1

1

Please try:

Option#1:

df1 = df.sort_values(['IMAGE_NAME','REF'], ascending=False)
df1 = df1.groupby('IMAGE_NAME').agg({'DETECTEDTEXT' : ' '.join , 'REF': 'last','CONFIDENCE':'last'}).reset_index()[['IMAGE_NAME','REF','CONFIDENCE','DETECTEDTEXT']]
df1.loc[df1['REF'].isnull(), 'CONFIDENCE'] = np.NaN
df1.rename(columns={'DETECTEDTEXT':'ALL_TEXT'},inplace=True)

Option#2

df1 = df.fillna('0')
df1 = df1.groupby('IMAGE_NAME').agg({'DETECTEDTEXT' : ' '.join , 'REF': 'max'}).reset_index()
df1 = df1.merge(df,on=['IMAGE_NAME','REF'], how='left')[['IMAGE_NAME','REF','CONFIDENCE','DETECTEDTEXT_x']]
df1 = df1.rename(columns={'DETECTEDTEXT_x' : 'ALL_TEXT'})
df1['REF'] = df1.REF.replace('0',np.NaN)

Both Prints:

  IMAGE_NAME       REF  CONFIDENCE                     ALL_TEXT
0    14.jpeg  9b01dc1e   98.488983  INY 9:36 MICKEYD19 Ln ADULT
1    15.jpeg       NaN         NaN                     Ln ADULT

Input df:

  DETECTEDTEXT  CONFIDENCE IMAGE_NAME       REF  ALL_TEXT
0          INY   73.215164    14.jpeg       NaN       NaN
1         9:36   91.633514    14.jpeg       NaN       NaN
2    MICKEYD19   89.422897    14.jpeg       NaN       NaN
3           Ln   59.588081    14.jpeg       NaN       NaN
4        ADULT   98.488983    14.jpeg  9b01dc1e       NaN
5           Ln   59.588081    15.jpeg       NaN       NaN
6        ADULT   98.488983    15.jpeg       NaN       NaN

Option#1: Option#1 is more elegant and came to me after I wrote Option#2. Just sorting the IMAGE_NAMe & 'REF' combo and using groupby.

Option#2: First replacing all NaNs to zeros for ease of calculation,the groupby with 'REF' : 'MAX returns 9b01dc1e for 14.jpeg and 0 for 15.jpeg. Now using pd.merge, pick the 'confidence' score corresponding to those REF values. For 14.jpeg, it returns the correct match for 9b01dc1e from original df and for 15.jpeg, it returns NaN since there is no match for 0 in the original df. So we get the required input.

Note: The code may need some changes if you can have multiple not null REF values for the same image. If so, we might have to do some other pre-processing as well. Other than that, this should work.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.