1

I have a huge dataset of about 1 million+ features, of which many are overlapping. I would like to remove all the overlaps in each instance of overlap, but keep to top-most polygon.

This is for a piece of automated software I am building to avoid using ArcGIS, so I'm trying open source Python, which will be packaged to an .exe file and sent to a customer.

I tried using GeoPandas but their overlay method is not doing the same as using a self-intersect tool and remove duplicate in ArcGIS.

Does anyone have any suggestions?

4
  • 2
    Add whatever code you have or your quesion will probably get closed. Do you want to only remove the intersecting parts of the lower polygon or the entire polygon? Commented Jul 27 at 7:31
  • @Bera I don't have ANY relevant code to share, because I literally took my shapefile and ran the overlay method with different options from the docs. As for your question - I do want to remove only the overlapping parts of the lower polygon. Commented Jul 27 at 7:34
  • 3
    Your Question founders on the definition of "lower polygon" (or the lack thereof) Commented Jul 27 at 14:13
  • 1
    @Vince Should I have explained my whole feature layer data schema? It doesn't matter, because "lower polygons" can change if I sort the layer. It's the mechanical basis in which polygons are compared, and then only the top (in which case is also the relevant) intersection is left. I have searched for a solution for days, have tried many different combinations of gpd methods, and only now Bera has offered a viable working solution, to a problem that many have been looking solve outside the ESRI suite. Commented Jul 28 at 4:42

1 Answer 1

2

Union the polygon boundaries into one multiline. Polygonize it and intersect with original df. This will create duplicate geometries where there are overlaps. Drop duplicates on centroid coordinates and area, keeping last (=topmost).

enter image description here

import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.ops import unary_union, polygonize

#Read a shapefile into a dataframe and plot
file = r"C:\Users\bera\Desktop\gistest\Overlapping_squares_1k.shp"
df = gpd.read_file(file)
df["id"] = range(df.shape[0]) #Create an id column
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10,10))

df.plot(ax=ax[0][0], alpha=0.3, color="red")
ax[0][0].set_title("Polygons with overlaps")

#Union the polygon boundaries into one multiline
multiline = unary_union(df.boundary)
gpd.GeoSeries(data=multiline, crs=df.crs).plot(ax=ax[0][1])
ax[0][1].set_title("The unioned boundaries")

#Polygonize the unioned boundaries    
poly_df = gpd.GeoDataFrame(geometry=list(polygonize(multiline)), crs=df.crs)
poly_df.plot(ax=ax[1][0], color="red", alpha=0.3)
ax[1][0].set_title("The polygonized boundaries")

#Intersect the polygonized multiline boundary with the original df
inter = gpd.overlay(df1=df, df2=poly_df, how="intersection", 
                    keep_geom_type=True, make_valid=True)

inter.plot(ax=ax[1][1], alpha=0.3, color="green")
ax[1][1].set_title("Original df and polygonized df inters.")

#Create columns of rounded centroid coordinates and area
# to use for detecting duplicates
inter["cent_x"] = round(inter.centroid.x)
inter["cent_y"] = round(inter.centroid.y)
inter["area_m2"] = inter.area.round()

#Sort ascending so larger ids (polygons on top) come last
inter = inter.sort_values(by="id")
#Drop duplicates on centroid coords and area, keep last/topmost
inter = inter.drop_duplicates(subset=["cent_x", "cent_y", "area_m2"], keep="last")

inter = inter.dissolve(by="id") #Dissolve split polygons back to original shapes
inter.to_file(r"C:\Users\bera\Desktop\gistest\confusion\no_overlaps.gpkg")

enter image description here

3
  • Hey, so I've tried this, and it did not work. I fed it a layer with 35k features, in which I know there are overlaps, and it output a .gpkg file with 6 features. None of which are even an overlap. Commented Jul 27 at 11:55
  • 1
    I want to clarify that I went ahead and ran the script again on the same dataset since I had a hunch something was wrong from the get go, and it in fact seems to have indeed worked now. I will run this again on a non-filtered dataset to see how it handles ~1m features. Will update, but for a small dataset and a short QA it does work. Thank you. Commented Jul 27 at 12:48
  • 1
    So I checked it on my dataset of ~1.1m features, and the output was great. I sampled it and all my checks came out good, so this is definitely a good solution! Thank you very much! Commented Jul 31 at 7:19

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.