I have two DataFrames, data1 and data2, with 3-level multiindices. The first two levels are floats, and correspond to spatial coordinates (say longitude and latitude). The third level, time, is based on pandas Period objects.
I want to select the rows of data1 whose index intersects with data2's, but allowing some tolerance in the spatial coordinates (i.e. they don't have to match up exactly). I have come up with a solution, which seems to work but is terribly slow. I do it in two steps:
- Compare each unique pair of spatial coordinates on data1's and data2's indices to find what spatial coordinates of data1 and data2 are equivalent to the given tolerance. This is the most time-consuming step because of the double loop where the comparison between coordinates is made.
- Create a "dictionary
DataFrame" of equivalence between overlapping coordinates based on data1's index. By leveragingset_indexandreset_indexI can switch the spatial part of the index fromdata1todata2coordinates, calculate the intersection withdata2and switch back todata1coordinates. This is also quite slow because of the loop with the.locmethod, see code below.
My description may be a bit confusing but the "Output" shown below should make it clear what I am trying to do.
I did some rudimentary profiling and step 1 takes ~80% of the total time. I also compared with a case in which I could simply use the MultiIndex.intersection() method and it's about ~1000 times slower than that.
I think there must be a faster way to do this, or maybe improve the performance of my solution?
Here's the code:
import pandas as pd
import numpy as np
import sys
# Create toy datasets
def create_toy_datasets():
periods1 = [pd.Period(year=x, freq='Y') for x in [1989,1990,1989,1990,1989,1990,1991]]
index = pd.MultiIndex.from_arrays([[1.,1.,1.,1.,2.,2.,2.],[1.,1.,2.,2.,3.,3.,3.],periods1], names=['lon', 'lat', 'time'])
data1 = pd.DataFrame(np.arange(len(index)), index=index, columns=['a'])
data1['b'] = data1['a']*2
periods2 = [pd.Period(year=x, freq='Y') for x in [1990,1990,1991,1992,1984]]
data2 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8],[9,10]], index=pd.MultiIndex.from_arrays([[1.1,2.1,2.1,2.1,3.1],[2.1,3.1,3.1,3.1,-3.1],periods2], names=['lon', 'lat', 'time']), columns=['a', 'b'])
return data1, data2
def return_overlapping_df(data1, data2, atol, rtol):
data1_unstacked = data1.unstack('time').sort_index()
data2_unstacked = data2.unstack('time').sort_index()
index1_unstacked = data1_unstacked.index
index2_unstacked = data2_unstacked.index
# Step 1: Find overlapping coordinates
overlapping1 = []
overlapping2 = []
for i1 in index1_unstacked:
for i2 in index2_unstacked:
if np.allclose(i1,i2, atol=atol, rtol=rtol):
overlapping1.append(i1)
overlapping2.append(i2)
break
# Step2: Dictionary of coordinates
cdict = data1.reset_index(['time']).loc[overlapping1,['time']]
for i1, i2 in zip(overlapping1, overlapping2):
cdict.loc[i1,['lon_other', 'lat_other']] = i2
cdict = cdict.reset_index(['lon', 'lat']).set_index(['lon_other', 'lat_other', 'time'])
index_overlap = cdict.index.intersection(data2.index)
cdict = cdict.loc[index_overlap]
index1 = cdict.reset_index('time').set_index(['lon', 'lat', 'time']).index
data1_overlap = data1.loc[index1]
return data1_overlap
if __name__ == '__main__':
# Create toy datasets with slightly offset coordinates
data1, data2 = create_toy_datasets()
# Calculate 'overlapping' frame
atol=0.2
rtol=0.
df_overlap = return_overlapping_df(data1, data2, atol=atol, rtol=rtol)
print("data1:")
print(data1)
print("data2:")
print(data2)
print(f"data1's rows that overlap with data2's to a tolerance of atol={atol}, rtol={rtol}:")
print(df_overlap)
Output
data1:
a b
lon lat time
1.0 1.0 1989 0 0
1990 1 2
2.0 1989 2 4
1990 3 6
2.0 3.0 1989 4 8
1990 5 10
1991 6 12
data2:
a b
lon lat time
1.1 2.1 1990 1 2
2.1 3.1 1990 3 4
1991 5 6
1992 7 8
3.1 -3.1 1984 9 10
data1's rows that overlap with data2's to a tolerance of atol=0.2, rtol=0:
a b
lon lat time
1.0 2.0 1990 3 6
2.0 3.0 1990 5 10
1991 6 12