I have a pandas dataframe containing the following data. the data is sorted by sessionid, datetime (ASC)
df = df.sort_values(['datetime','session_id'],ascending=True)
| session_id | source | datetime |
|---|---|---|
| 1 | 2021-01-23 11:26:34.166000 | |
| 1 | 2021-01-23 11:26:35.202000 | |
| 2 | NULL/NAN | 2021-01-23 11:05:10.001000 |
| 2 | 2021-01-23 11:05:17.289000 | |
| 3 | NULL/NAN | 2021-01-23 13:12:32.914000 |
| 3 | NULL/NAN | 2021-01-23 13:12:40.883000 |
my desired result should be ( row from each ++session_id++ with first non-null value in ++source++ column and if all null, then return first appearance ( case id = 3) )
| session_id | source | datetime |
|---|---|---|
| 1 | 2021-01-23 11:26:34.166000 | |
| 2 | 2021-01-23 11:05:17.289000 | |
| 3 | NULL/NAN | 2021-01-23 13:12:32.914000 |
The functions first_valid_index and first give me somehow the results I want.
The find_first_value:
- returns the index of the row containing the first valid index and if None it returns no index, which causes me to lose one session_id of my original table.
| session_id | source | datetime |
|---|---|---|
| 1 | 2021-01-23 11:26:34.166000 | |
| 2 | 2021-01-23 11:05:17.289000 |
x = df.groupby(by="session_id")'om_source'].transform(pd.Series.first_valid_index ) newdf = df[df.index==x]
The first:
it returns the first non null value ++but for each one of the columns separated++ which is not what I am looking for
| session_id | source | datetime |
|---|---|---|
| 1 | 2021-01-23 11:26:34.166000 | |
| 2 | 2021-01-23 11:05:10.001000 | |
| 3 | NULL/NAN | 2021-01-23 13:12:32.914000 |
newdf = df.groupby(by="session_id").first()
I tried to do something like this, but this unfortunately did not work.
df.groupby(by="session_id")['om_source']
.transform(first if ( pd.Series.first_valid_index is None ) else pd.Series.first_valid_index)
Do you have any suggestions? ( I am new to pandas, I am still trying to understand the 'logic' behind it )
Thanks in advance for your time.