Let's consider the following input data
| incremental_id | session_start_id | session_end_id | items_bought |
|----------------|------------------|----------------|--------------|
| 1 | a | b | 1 |
| 2 | z | t | 7 |
| 3 | b | c | 0 |
| 4 | c | d | 3 |
Where:
- Each row represents a user session
- Each session records a start/end session id
- We know the first 3 rows are associated to the same user because
session_end_id = session_start_id. The 4th row is instead related to a second user
I want to be able to aggregate the above data so that I can get:
- The first customer has bought 4 items
- The second customer has bought 7 items
How could this be done in PySpark (or eventually in pure SQL)? I would like to avoid using UDFs in PySpark, but it's ok if that's the only way.
Thanks for your help!
Edit:
I have update the example dataframe, the incremental_id alone cannot be used to order the rows as consecutive sessions
5,d,e,1I assume.