0

Below is my sample dataframe for household things.

Here W represents Wooden G represents Glass and P represents Plastic, and different items are classified in that category. So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair

M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
                                ('W-Chair',''),
                                ('W-Shelf;G-Cup;P-Chair',''),
                                  ('G-Cup;P-ShowerCap;W-Board','')],
                                 ['Household_chores_arrangements','Chair'])

M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|   W-Chair-Shelf;G-Vase;P-Cup|     |
|                      W-Chair|     |
|        W-Shelf;G-Cup;P-Chair|     |
|    G-Cup;P-ShowerCap;W-Board|     |
+-----------------------------+-----+

I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.

df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)

Is there a better way to do this in pySpark

Expected output

+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|   W-Chair-Shelf;G-Vase;P-Cup|    W|
|                      W-Chair|    W|
|        W-Shelf;G-Cup;P-Chair|    P|
|    G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+

Thanks @mck - for the solution.

Update In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set

M = sqlContext.createDataFrame([('Wooden|Chair',''),
                                ('Wooden|Cup;Glass|Chair',''),
                                ('Wooden|Cup;Glass|Showercap;Plastic|Chair','')        ],
                                 ['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
    select 
        Household_chores_arrangements, 
        nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair 
    from M
""")
display(df)

Result:

+-----------------------------+-----------------+
|Household_chores_arrangements|            Chair|
+-----------------------------+-----------------+
|                 Wooden|Chair           |Wooden|
|       Wooden|Cup;Glass|Chair           |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+

Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result

+-----------------------------+-----------------+
|Household_chores_arrangements|            Chair|
+-----------------------------+-----------------+
|                 Wooden|Chair           |Wooden|
|       Wooden|Cup;Glass|Chair           |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+

If delimiter alone is changed,should we need to change any other values?

update - 2

I have got the solution for the above mentioned update.

For pipe delimiter we have to escape them using 4 \

0

1 Answer 1

1

You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.

df = spark.sql("""
    select 
        Household_chores_arrangements, 
        nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair 
    from M
""")

df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup   |W    |
|W-Chair                      |W    |
|W-Shelf;G-Cup;P-Chair        |P    |
|G-Cup;P-ShowerCap;W-Board    |null |
+-----------------------------+-----+
Sign up to request clarification or add additional context in comments.

5 Comments

ok!!! But if it was Wooden instead of W in household_chores_arrangements,do we need to change them to approx index?
Then you need to change the regex pattern to, e.g. '(Wooden|Glass|Plastic)-Chair'
So in regexp_extract- 1 indicates noting but the group value/number that needs to be picked as a result. and -chair is the one we are going to match.- indicates the delimiter in sample data.Hope I am right..Also this worked well with this example,while investigating more one regexp_extract I tried to replace - with | in sample data aswell as the query,but it didnt give me the expected result.Let me see if I can post an update in the same question
I got the solution for | delimiter we have to use 4 escape (Wooden|Glass|Plastic)(\\\\|Chair)
yes, | is a special regex character that needs to be escaped.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.