Align python arrays with missing data

Question

I have some time series data, say:

# [ [time] [ data ] ]
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4]['f','g','h']]

and I would like an output with some filler value, lets say None for now:

a_new = [[0,1,2,3,4],['a','b','c','d','e']]
b_new = [[0,1,2,3,4],['f',None,None,'g','h']]

Is there a built in function in python/numpy to do this (or something like this)? Basically I would like to have all of my time vectors of equal size so I can calculate statistics (np.mean) and deal with the missing data accordingly.

Are you wedded to using numpy? It sounds a lot like you're asking for the indexing behaviour of pandas DataFrames. — DSM
– DSM, Commented Sep 6, 2016 at 23:18

user94559 · Accepted Answer · 2016-09-06 23:20:35Z

4

How about this? (I'm assuming your definition of b was a typo, and I'm also assuming you know in advance how many entries you want.)

>>> b = [[0,3,4], ['f','g','h']]
>>> b_new = [list(range(5)), [None] * 5]
>>> for index, value in zip(*b): b_new[1][index] = value
>>> b_new
[[0, 1, 2, 3, 4], ['f', None, None, 'g', 'h']]

answered Sep 6, 2016 at 23:20

user94559

60.3k6 gold badges108 silver badges107 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

nven Over a year ago

This would work if the step size between time points was consistent, but this will (almost) never be the case with the data i'm working with

user94559 Over a year ago

Can you describe what you mean by "the step size between time points"? Maybe give an example of data for which this answer doesn't work?

user94559 Over a year ago

Oh, maybe you mean that you won't have consecutive integer values for indexes on the left, so you really want to do something like take a list of these, find the union of all the time values, and then fill in None in each individual series where a time value is absent.

nven Over a year ago

Yep, exactly. So the sample data would be: a = [ [0,5,20,30],['a','b','c','d']] b = [ [0, 2.5, 5, 10],['e','f','g','h']

benten · Accepted Answer · 2016-09-07 02:41:49Z

1

smarx has a fine answer, but pandas was made exactly for things like this.

# your data
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4],['f','g','h']]

# make an empty DataFrame (can do this faster but I'm going slow so you see how it works)
df_a = pd.DataFrame()
df_a['time'] = a[0]
df_a['A'] = a[1]
df_a.set_index('time',inplace=True)

# same for b (a faster way this time)
df_b = pd.DataFrame({'B':b[1]}, index=b[0]) 

# now merge the two Series together (the NaNs are in the right place)
df = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer') 

In [28]: df
Out[28]: 
     A    B
0    a    f
1    b  NaN
2    c  NaN
3    d    g
4    e    h

Now the fun is just beginning. Within a DataFrame you can

compute all of your summary statistics (e.g. df.mean())
make plots (e.g. df.plot())
slice/dice your data basically however you want (e.g df.groupby())
Fill in or drop missing data using a specified method (e.g. df.fillna()),
take quarterly or monthly averages (e.g. df.resample()) and a lot more.

If you're just getting started (sorry for the infomercial it you aren't), I recommend reading 10 minutes to pandas for a quick overview.

edited Sep 7, 2016 at 2:41

answered Sep 7, 2016 at 0:51

benten

1,9892 gold badges24 silver badges38 bronze badges

1 Comment

nven Over a year ago

Ah, I knew there would be a better way to do this already. Thanks!

Divakar · Accepted Answer · 2016-09-07 07:24:20Z

0

Here's a vectorized NumPythonic approach -

def align_arrays(A):
    time, data = A

    time_new = np.arange(np.max(time)+1)

    data_new = np.full(time_new.size, None, dtype=object)
    data_new[np.in1d(time_new,time)] = data

    return time_new, data_new

Sample runs -

In [113]: a = [[0,1,2,3,4],['a','b','c','d','e']]

In [114]: align_arrays(a)
Out[114]: (array([0, 1, 2, 3, 4]), array(['a', 'b', 'c', 'd', 'e'], dtype=object))

In [115]: b = [[0,3,4],['f','g','h']]

In [116]: align_arrays(b)
Out[116]: (array([0, 1, 2, 3, 4]),array(['f', None, None, 'g', 'h'],dtype=object))

answered Sep 7, 2016 at 7:24

Divakar

222k19 gold badges273 silver badges374 bronze badges

1 Comment

nven Over a year ago

Thanks for the answer. The problem is that although the sample data is linear incremented, that actual data is not.

Collectives™ on Stack Overflow

Align python arrays with missing data

3 Answers 3

4 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related