1

I have two DataFrames, data1 and data2, with 3-level multiindices. The first two levels are floats, and correspond to spatial coordinates (say longitude and latitude). The third level, time, is based on pandas Period objects.

I want to select the rows of data1 whose index intersects with data2's, but allowing some tolerance in the spatial coordinates (i.e. they don't have to match up exactly). I have come up with a solution, which seems to work but is terribly slow. I do it in two steps:

  1. Compare each unique pair of spatial coordinates on data1's and data2's indices to find what spatial coordinates of data1 and data2 are equivalent to the given tolerance. This is the most time-consuming step because of the double loop where the comparison between coordinates is made.
  2. Create a "dictionary DataFrame" of equivalence between overlapping coordinates based on data1's index. By leveraging set_index and reset_index I can switch the spatial part of the index from data1 to data2 coordinates, calculate the intersection with data2 and switch back to data1 coordinates. This is also quite slow because of the loop with the .loc method, see code below.

My description may be a bit confusing but the "Output" shown below should make it clear what I am trying to do.

I did some rudimentary profiling and step 1 takes ~80% of the total time. I also compared with a case in which I could simply use the MultiIndex.intersection() method and it's about ~1000 times slower than that.

I think there must be a faster way to do this, or maybe improve the performance of my solution?

Here's the code:

import pandas as pd
import numpy as np
import sys

# Create toy datasets
def create_toy_datasets():
    periods1 = [pd.Period(year=x, freq='Y') for x in [1989,1990,1989,1990,1989,1990,1991]]
    index = pd.MultiIndex.from_arrays([[1.,1.,1.,1.,2.,2.,2.],[1.,1.,2.,2.,3.,3.,3.],periods1], names=['lon', 'lat', 'time'])
    data1 = pd.DataFrame(np.arange(len(index)), index=index, columns=['a'])
    data1['b'] = data1['a']*2
    periods2 = [pd.Period(year=x, freq='Y') for x in [1990,1990,1991,1992,1984]]
    data2 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8],[9,10]], index=pd.MultiIndex.from_arrays([[1.1,2.1,2.1,2.1,3.1],[2.1,3.1,3.1,3.1,-3.1],periods2], names=['lon', 'lat', 'time']), columns=['a', 'b'])
    return data1, data2


def return_overlapping_df(data1, data2, atol, rtol):
    data1_unstacked = data1.unstack('time').sort_index()
    data2_unstacked = data2.unstack('time').sort_index()
    index1_unstacked = data1_unstacked.index
    index2_unstacked = data2_unstacked.index

    # Step 1: Find overlapping coordinates
    overlapping1 = []
    overlapping2 = []
    for i1 in index1_unstacked:
        for i2 in index2_unstacked:
            if np.allclose(i1,i2, atol=atol, rtol=rtol):
                overlapping1.append(i1)
                overlapping2.append(i2)
                break
        
    # Step2: Dictionary of coordinates
    cdict = data1.reset_index(['time']).loc[overlapping1,['time']]
    for i1, i2 in zip(overlapping1, overlapping2):
        cdict.loc[i1,['lon_other', 'lat_other']] = i2

    cdict = cdict.reset_index(['lon', 'lat']).set_index(['lon_other', 'lat_other', 'time'])
    index_overlap = cdict.index.intersection(data2.index)
    cdict = cdict.loc[index_overlap]
    index1 = cdict.reset_index('time').set_index(['lon', 'lat', 'time']).index
    data1_overlap = data1.loc[index1]

    return data1_overlap


if __name__ == '__main__':
    # Create toy datasets with slightly offset coordinates
    data1, data2 = create_toy_datasets()
    # Calculate 'overlapping' frame
    atol=0.2
    rtol=0.
    df_overlap = return_overlapping_df(data1, data2, atol=atol, rtol=rtol)
    print("data1:")
    print(data1)
    print("data2:")
    print(data2)
    print(f"data1's rows that overlap with data2's to a tolerance of atol={atol}, rtol={rtol}:")
    print(df_overlap)

Output

data1:
              a   b
lon lat time       
1.0 1.0 1989  0   0
        1990  1   2
    2.0 1989  2   4
        1990  3   6
2.0 3.0 1989  4   8
        1990  5  10
        1991  6  12
data2:
               a   b
lon lat  time       
1.1  2.1 1990  1   2
2.1  3.1 1990  3   4
         1991  5   6
         1992  7   8
3.1 -3.1 1984  9  10
data1's rows that overlap with data2's to a tolerance of atol=0.2, rtol=0:
              a   b
lon lat time       
1.0 2.0 1990  3   6
2.0 3.0 1990  5  10
        1991  6  12
2
  • 1
    I think at the core, this is the same problem: stackoverflow.com/questions/79313103/… It's a join with multiple inequality condition. In your case, a join on 2 columns with tolerances can be expressed as a join with 4 inequality conditions. At the time, there was no alternative to Numba with git compilation, as described in that thread - since there was no vectorized implementation in Polars (or Pandas). There is only merge_asof() in Pandas but it allows for only one inequality... Commented Oct 18 at 18:47
  • 1
    To be clear, if it's working and needs improvement, consider posting on codereview.stackexchange.com Commented Oct 19 at 0:53

1 Answer 1

1

as @usdn pointed out, this is some form of inequality join; one option is with pyjanitor's conditional_join, which handles inequality joins efficiently. I am a contributor to the pyjanitor library:

# pip install pyjanitor
import pandas as pd
import janitor

# reset index; 
# janitor.conditional_join works only on columns 
# and specific data types - numbers, dates

d1=data1.reset_index()
d1=data1.reset_index()
d1['timee']=d1.time.dt.year
d2=data2.reset_index()
d2['timee']=d2.time.dt.year

# Build conditions to compute the joins
cond1=[('lon','lon','>='),('lat','lat','>='),('timee','timee','==')]
cond2=[('lon','lon','<='),('lat','lat','<='),('timee','timee','==')]

# compute the individual joins
# you may use numba to get some performance
# especially if the equality join column has a lot of duplicates;
# i.e it has a small number of unique values, compared
# to the actual dataframe length
# also, if your inequality joins are range joins
# then you could get more performance
# however the joins are a single type, so perf wise so-so
A=d1.conditional_join(d2, *cond1, use_numba=False)
B=d1.conditional_join(d2, *cond2, use_numba=False)

# compute masks to determine which rows meet the tolerance definition:
mask=(A['left']
      .loc[:, ['lon','lat']]
      .sub(A['right'].loc[:, ['lon','lat']])
      .abs()
      .le(atol)
      .all(axis=1)
     )
A=A[mask]

mask=(B['left']
      .loc[:, ['lon','lat']]
      .sub(B['right'].loc[:, ['lon','lat']])
      .abs()
      .le(atol)
      .all(axis=1)
     )
B=B[mask]

# combine the dataframes to get the final form:

final=pd.concat([A,B],ignore_index=True,sort=False,copy=False)
final

  left                         right
   lon  lat  time  a   b timee   lon  lat  time  a  b timee
0  1.0  2.0  1990  3   6  1990   1.1  2.1  1990  1  2  1990
1  2.0  3.0  1990  5  10  1990   2.1  3.1  1990  3  4  1990
2  2.0  3.0  1991  6  12  1991   2.1  3.1  1991  5  6  1991


In [106]: final['left']
Out[106]:
   lon  lat  time  a   b  timee
0  1.0  2.0  1990  3   6   1990
1  2.0  3.0  1990  5  10   1990
2  2.0  3.0  1991  6  12   1991

In [107]: final['right']
Out[107]:
   lon  lat  time  a  b  timee
0  1.1  2.1  1990  1  2   1990
1  2.1  3.1  1990  3  4   1990
2  2.1  3.1  1991  5  6   1991

under the hood, this approach uses pandas' merge functionality to get the join indices, before pruning. with the numba approach, a binary search method is applied, which can give more performance depending on the data distribution. keen to see how it works on your actual code, and help where possible.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi, thanks for your reply, this looks good. As @usdn said, though, there would be 4 conditions, right? The two you wrote plus = [('lon','lon','>='),('lat','lat','<='),('timee','timee','==')] and [('lon','lon','<='),('lat','lat','>='),('timee','timee','==')]. I will try this in my code and let you know how it went.
So, I tested this with actual small test datasets that I use and IPython's %timeit magic. My original code took 1.6 seconds, and your approach took ~500ms without numba and ~200ms with numba, so around 8x faster. I also realized while testing that the conditions need to not overlap, otherwise there'll be duplicate rows in the final dataset when the datasets have rows with equal coordinates (not in the example above).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.