4

I have some time series data, say:

# [ [time] [ data ] ]
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4]['f','g','h']]

and I would like an output with some filler value, lets say None for now:

a_new = [[0,1,2,3,4],['a','b','c','d','e']]
b_new = [[0,1,2,3,4],['f',None,None,'g','h']]

Is there a built in function in python/numpy to do this (or something like this)? Basically I would like to have all of my time vectors of equal size so I can calculate statistics (np.mean) and deal with the missing data accordingly.

1
  • 3
    Are you wedded to using numpy? It sounds a lot like you're asking for the indexing behaviour of pandas DataFrames. Commented Sep 6, 2016 at 23:18

3 Answers 3

4

How about this? (I'm assuming your definition of b was a typo, and I'm also assuming you know in advance how many entries you want.)

>>> b = [[0,3,4], ['f','g','h']]
>>> b_new = [list(range(5)), [None] * 5]
>>> for index, value in zip(*b): b_new[1][index] = value
>>> b_new
[[0, 1, 2, 3, 4], ['f', None, None, 'g', 'h']]
Sign up to request clarification or add additional context in comments.

4 Comments

This would work if the step size between time points was consistent, but this will (almost) never be the case with the data i'm working with
Can you describe what you mean by "the step size between time points"? Maybe give an example of data for which this answer doesn't work?
Oh, maybe you mean that you won't have consecutive integer values for indexes on the left, so you really want to do something like take a list of these, find the union of all the time values, and then fill in None in each individual series where a time value is absent.
Yep, exactly. So the sample data would be: a = [ [0,5,20,30],['a','b','c','d']] b = [ [0, 2.5, 5, 10],['e','f','g','h']
1

smarx has a fine answer, but pandas was made exactly for things like this.

# your data
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4],['f','g','h']]

# make an empty DataFrame (can do this faster but I'm going slow so you see how it works)
df_a = pd.DataFrame()
df_a['time'] = a[0]
df_a['A'] = a[1]
df_a.set_index('time',inplace=True)

# same for b (a faster way this time)
df_b = pd.DataFrame({'B':b[1]}, index=b[0]) 

# now merge the two Series together (the NaNs are in the right place)
df = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer') 

In [28]: df
Out[28]: 
     A    B
0    a    f
1    b  NaN
2    c  NaN
3    d    g
4    e    h

Now the fun is just beginning. Within a DataFrame you can

  • compute all of your summary statistics (e.g. df.mean())

  • make plots (e.g. df.plot())

  • slice/dice your data basically however you want (e.g df.groupby())

  • Fill in or drop missing data using a specified method (e.g. df.fillna()),

  • take quarterly or monthly averages (e.g. df.resample()) and a lot more.

If you're just getting started (sorry for the infomercial it you aren't), I recommend reading 10 minutes to pandas for a quick overview.

1 Comment

Ah, I knew there would be a better way to do this already. Thanks!
0

Here's a vectorized NumPythonic approach -

def align_arrays(A):
    time, data = A

    time_new = np.arange(np.max(time)+1)

    data_new = np.full(time_new.size, None, dtype=object)
    data_new[np.in1d(time_new,time)] = data

    return time_new, data_new

Sample runs -

In [113]: a = [[0,1,2,3,4],['a','b','c','d','e']]

In [114]: align_arrays(a)
Out[114]: (array([0, 1, 2, 3, 4]), array(['a', 'b', 'c', 'd', 'e'], dtype=object))

In [115]: b = [[0,3,4],['f','g','h']]

In [116]: align_arrays(b)
Out[116]: (array([0, 1, 2, 3, 4]),array(['f', None, None, 'g', 'h'],dtype=object))

1 Comment

Thanks for the answer. The problem is that although the sample data is linear incremented, that actual data is not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.