How to identify if a particular string/pattern exist in a column using pySpark

Question

Below is my sample dataframe for household things.

Here W represents Wooden G represents Glass and P represents Plastic, and different items are classified in that category. So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair

M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
                                ('W-Chair',''),
                                ('W-Shelf;G-Cup;P-Chair',''),
                                  ('G-Cup;P-ShowerCap;W-Board','')],
                                 ['Household_chores_arrangements','Chair'])

M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|   W-Chair-Shelf;G-Vase;P-Cup|     |
|                      W-Chair|     |
|        W-Shelf;G-Cup;P-Chair|     |
|    G-Cup;P-ShowerCap;W-Board|     |
+-----------------------------+-----+

I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.

df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)

Is there a better way to do this in pySpark

Expected output

+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|   W-Chair-Shelf;G-Vase;P-Cup|    W|
|                      W-Chair|    W|
|        W-Shelf;G-Cup;P-Chair|    P|
|    G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+

Thanks @mck - for the solution.

Update In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set

M = sqlContext.createDataFrame([('Wooden|Chair',''),
                                ('Wooden|Cup;Glass|Chair',''),
                                ('Wooden|Cup;Glass|Showercap;Plastic|Chair','')        ],
                                 ['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
    select 
        Household_chores_arrangements, 
        nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair 
    from M
""")
display(df)

Result:

+-----------------------------+-----------------+
|Household_chores_arrangements|            Chair|
+-----------------------------+-----------------+
|                 Wooden|Chair           |Wooden|
|       Wooden|Cup;Glass|Chair           |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+

Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result

+-----------------------------+-----------------+
|Household_chores_arrangements|            Chair|
+-----------------------------+-----------------+
|                 Wooden|Chair           |Wooden|
|       Wooden|Cup;Glass|Chair           |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+

If delimiter alone is changed,should we need to change any other values?

update - 2

I have got the solution for the above mentioned update.

For pipe delimiter we have to escape them using 4 \

mck · Accepted Answer · 2021-04-21 08:07:29Z

1

You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.

df = spark.sql("""
    select 
        Household_chores_arrangements, 
        nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair 
    from M
""")

df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup   |W    |
|W-Chair                      |W    |
|W-Shelf;G-Cup;P-Chair        |P    |
|G-Cup;P-ShowerCap;W-Board    |null |
+-----------------------------+-----+

answered Apr 21, 2021 at 8:07

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

USB Over a year ago

ok!!! But if it was Wooden instead of W in household_chores_arrangements,do we need to change them to approx index?

mck Over a year ago

Then you need to change the regex pattern to, e.g. '(Wooden|Glass|Plastic)-Chair'

USB Over a year ago

So in regexp_extract- 1 indicates noting but the group value/number that needs to be picked as a result. and -chair is the one we are going to match.- indicates the delimiter in sample data.Hope I am right..Also this worked well with this example,while investigating more one regexp_extract I tried to replace - with | in sample data aswell as the query,but it didnt give me the expected result.Let me see if I can post an update in the same question

USB Over a year ago

I got the solution for | delimiter we have to use 4 escape (Wooden|Glass|Plastic)(\\\\|Chair)

mck Over a year ago

yes, | is a special regex character that needs to be escaped.

Collectives™ on Stack Overflow

How to identify if a particular string/pattern exist in a column using pySpark

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related