How to implement SQL in Pandas dataframe?

Question

I have an SQL query which I want to implement it on the pandas dataframe data. This SQL is actually filtering the t_id if the e_count is null for more than 90 percent of the cases for type HIST. I have dataframe with all the columns but need to implement this logic in Python Pandas dataframe

Below sql query:-

SELECT * 
FROM 
(
    SELECT 
        t_id,
        s_id,
        t_name,
         type,
        act_dt
    FROM
        tblstg 
)t
WHERE t.t_id NOT IN
(
    SELECT t_id FROM
     (   
        SELECT t_id,CASE WHEN (CAST(SUM(CASE WHEN act_dt IS NULL THEN 1 ELSE 0 END) AS FLOAT)/count(*) * 100) > 90 THEN 1 ELSE 0 END AS removal_flg
        FROM tblstg
        WHERE type = 'HIST'
        GROUP BY t_id 
     )st
     where st.removal_flg = 1

Dataframe :

I have a pandas dataframe like

t_id   s_id      t_name    type   act_dt
    T1     china     android   HIST   jan
    T1     mumbai    android   HIST   feb
    T1     dubai     apple     EXT  
    T2     japan     nokia     HIST   
    T3     japan     apple     HIST   jan
    T3     koria     HIST   
    T3     japan1    apple     HIST   
    T3     japan2    apple     HIST 
    T3     japan3    apple     HIST 
    T3     japan4    apple     HIST 
   T3     japan5    apple     HIST 
   T3     japan6    apple     HIST 
   T3     japan7    apple     HIST 
   T3     japan8    apple     HIST 
    T3     dubai     nokia     EXT

final result:

t_id   s_id      t_name    type   act_dt
T1     china     android   HIST   jan
T1     mumbai    android   HIST   feb
T1     dubai     apple     EXT

i.e. drop T2 and T3 as 90% records for type HIST for that t_id for any s_id, act_dt is null.

I have written the below code to identify the t_id's that has got 90% of data for act_dt as null so i can drop it from main dataframe raw_data but it is giving error from 3rd line. How can I get the list of t_id's that meet the criteria?

I have written the below code to find out the t_id's that

raw_data['filter_dt'] = raw_data['act_dt'].isnull().astype(int)
            
filterrecords = raw_data[raw_data.type == 'HIST'].groupby("t_id").filter_dt.sum()

countoftids = raw_data[raw_data.type == 'HIST'].groupby("t_id").count()

finalflg = filterrecords / countoftids

finaltids = raw_data['t_id'][finalflg > 0.90]

added code that i tried but i am struggling to achieve that so I asked the question to the forum — user3222101
– user3222101, Commented Jun 21, 2017 at 12:42

Scott Boston · Accepted Answer · 2017-06-21 14:37:23Z

2

Let's try this:

df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)

OUtput:

  t_id    s_id   t_name  type act_dt
0   T1   china  android  HIST    jan
1   T1  mumbai  android  HIST    feb
2   T1   dubai    apple   EXT    NaN

To see a list of unique t_id retained...

print(df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)['t_id'].unique())

Output:

['T1']

edited Jun 21, 2017 at 14:37

answered Jun 21, 2017 at 13:22

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user3222101 Over a year ago

Thank you scott..i tried this but got the below error"ufunc true_divide cannot use operands with types dtype('<M8[ns]') and dtype('int64')"

Scott Boston Over a year ago

Ttry that new statement... df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)

user3222101 Over a year ago

Thanks Scott! i got the output..but is there any way to get the unique t_id's that got skipped or dropped so i can test the same?

Scott Boston Over a year ago

If you add to the end of that statement ['t_id'].unique() it will give you a list of unique ids.

user3222101 Over a year ago

hi scott..it is working fine but it is somehow including some other records as well..like there is one t_id that has 3 instances for type HIST and out of which 2 are null ..note there are EXT instances as well..but we are only interested in HIST...any idea why?

|

Collectives™ on Stack Overflow

How to implement SQL in Pandas dataframe?

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related