5

I have a series of timestamps (+ other data) that come from 2 separate streams of data ticking at different rates, an example below (NB: the frequency of the real data has some jitter so it's not a simple fixed stride like below)

src,idx,ts
B,1,20
A,1,100
A,2,200
A,3,300
B,2,320
A,4,400
A,5,500
A,6,600
B,3,620

for each A tick, I need to calculate the offset from the preceding B tick so it would become

src,idx,ts
A,1,80
A,2,180
A,3,280
A,4,80
A,5,180
A,6,280

how to do this in pandas without iteration?

I thought of some sort of rolling window but with a dynamic/criteria based window or some hybrid of merge_asof and group by but can't think of a way to do it.

3 Answers 3

5

You could group by changing B and subtract the first (B-row) from each group ts. Then maybe filter by not equal B to reproduce your desired final df:

import pandas as pd

df = pd.DataFrame(
    {"src": ["B", "A", "A", "A", "B", "A", "A", "A", "B"], 
     "idx": [1, 1, 2, 3, 2, 4, 5, 6, 3], 
     "ts": [20, 100, 200, 300, 320, 400, 500, 600, 620]}
)

df["ts"] -= df.groupby(df.src.eq("B").cumsum())["ts"].transform("first")

df.query("src != 'B'")

More detail:

df.src.eq("B").cumsum() gives a Series which increases by one each time a "B" is encountered. This is what we want to group the DataFrame into sections between subsequent "B"s. For each group between each B (inclusive) and the following B (exclusive), we subtract the ts value at the first B position from all ts values within the group, hence resetting to zero at each B.

Sign up to request clarification or add additional context in comments.

2 Comments

Apply should be avoided here.
good point you make. edited.
3

Here is another implementation. I have not benchmarked it. It relies on a forward-fill.

import pandas as pd

df = pd.DataFrame({
    'src': ['B', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'B'],
     'idx': [1, 1, 2, 3, 2, 4, 5, 6, 3],
     'ts': [20, 100, 200, 300, 320, 400, 500, 600, 620],
})

bts = df.loc[df['src'] == 'B', 'ts'].reindex(df.index, method='ffill')
df['delta'] = df['ts'] - bts
print(df)
  src  idx   ts  delta
0   B    1   20      0
1   A    1  100     80
2   A    2  200    180
3   A    3  300    280
4   B    2  320      0
5   A    4  400     80
6   A    5  500    180
7   A    6  600    280
8   B    3  620      0

If you really only want the A rows, then

import pandas as pd

df = pd.DataFrame({
    'src': ['B', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'B'],
     'idx': [1, 1, 2, 3, 2, 4, 5, 6, 3],
     'ts': [20, 100, 200, 300, 320, 400, 500, 600, 620],
})

is_a = df['src'] == 'A'
bts = df.loc[~is_a, 'ts'].reindex(df.index, method='ffill')
df['delta'] = df['ts'] - bts
print(df.loc[is_a, ['idx', 'delta']])
   idx  delta
1    1     80
2    2    180
3    3    280
5    4     80
6    5    180
7    6    280

Comments

2

Another possible solution:

m = df['src'].eq('B')
df.assign(ts = df['ts'].sub(df['ts'].where(m).ffill()))[~m]

It first creates a Boolean mask m to identify rows where src is B. Then, using Series.where, it keeps timestamps only where src is "B" and replaces other entries with NaN; next, Series.ffill forward-fills these timestamps so that each A row gets the timestamp of the preceding B. Finally, the code subtracts this forward-filled B timestamp from each original timestamp via Series.sub and returns only the rows where src is not B.

Output:

  src  idx     ts
1   A    1   80.0
2   A    2  180.0
3   A    3  280.0
5   A    4   80.0
6   A    5  180.0
7   A    6  280.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.