Recently for a project, I was able to play around with some of the cool geopy packages out there and was thinking instead of just shelving the code after the project, that I'd share out some geo-snippets. Well...this post is the start of that.
So for the first share out, I want to run through performing spatial queries in python and the benefits of using a spatial index and I wanted to do that via two examples.
virtualenv env . env/bin/activate pip install shapely fiona rtree
If you're having issues installing any of these packages here's links to their respective install docs.
- shapely: https://github.com/Toblerity/Shapely#installing-shapely
- fiona: https://github.com/Toblerity/Fiona#installation
- rtree: http://toblerity.org/rtree/install.html
Alright now that everything is setup, let's run through the examples to answer the question of
"how many addresses are within each garbage zone".
The first example, uses a "brute force" method for finding our answer and the second we'll be a little smarter about it and take advantage of a spatial index. But before we start, let me quickly introduce our two volunteer datasets...
AddressPoint Shapefile source:
* Geometry Type: Point * Number of features: 34,084 * Spatial Reference System: [UTM 10]
* Geometry Type: Polygon * Number of features: 8 * Spatial Reference System: [UTM 10]
1: Brute Force Spatial Query
Honestly this example is only here to show how important it is to use a spatial index when performing spatial queries. Simply put, this method will iterate over every garbage zone and every address point to see which address points are contained in each garbage zone polygon.
def query_no_index(layer1, layer2, group_field): result = defaultdict(list) with fiona.open(layer2) as datasrc2: with fiona.open(layer1) as datasrc1: for ft1 in datasrc1: if ft1['geometry'] is None: continue shp1 = shape(ft1['geometry']) for ft2 in datasrc2: shp2 = shape(ft2['geometry']) if shp1.contains(shp2): result[ft1['properties'][group_field]].append(int(ft2['id'])) return result
2: Index Based Spatial Query
Similar to method 1, but we'll create and use a RTree index. This index allows us to run an initial "simplified and fast" check to determine, in our example, which addresses are near a particular garbage zone. Since the result of the initial check is based on bounding boxes of the address and garbage zone geometries, we'll still need to do the
contains check on the actual geometries but now the number of checks that we potentially need run is vastly less.
# Create in-memory or file-based r-tree index def generate_index(records, index_path=None): prop = rtree.index.Property() if index_path is not None: prop.storage = rtree.index.RT_Disk prop.overwrite = index_path sp_index = rtree.index.Index(index_path, properties=prop) for n, ft in enumerate(records): if ft['geometry'] is not None: sp_index.insert(n, shape(ft['geometry']).bounds) return sp_index def query_with_index(layer1, layer2, group_field): result = defaultdict(list) with fiona.open(layer2) as datasrc2: fname, _ = os.path.splitext(layer2) # Load index from file if os.path.exists(fname + '.dat') and os.path.exists(fname + '.idx'): layer2_index = rtree.index.Index(fname) # Need to create a new index. else: layer2_index = generate_index(datasrc2, fname) with fiona.open(layer1) as datasrc1: for ft1 in datasrc1: if ft1['geometry'] is None: continue shp1 = shape(ft1['geometry']) # --> Do an initial bounds query with the spatial index <-- # Returns back a generator of addresses ids whose bounds intersects # the garbage zone bounds for fid in layer2_index.intersection(shp1.bounds): # Now check if the address point is contained by # the garbage zone shp2 = shape(datasrc2[fid]['geometry']) if shp1.contains(shp2): result[ft1['properties'][group_field]].append(fid) return result
s1 = 'GIS_GarbageZone.shp' s2 = 'AddressPoint.shp' def print_results(r): for k in sorted(r.iterkeys()): print '%s: %s' % (k, len(r[k])) # No index used print('# 1. No index') print_results(query_no_index(s1, s2, 'NAME')) # Index not generated print('# 2a. Index: Generate') print_results(query_with_index(s1, s2, 'NAME')) # Index loaded from file print('# 2b. Index: From file') print_results(query_with_index(s1, s2, 'NAME'))
# 1. No index ZONE 1: 5968 ZONE 2: 5345 ZONE 3: 7537 ZONE 4: 8615 ZONE 5: 6616 # 2a. Index: Generate ZONE 1: 5968 ZONE 2: 5345 ZONE 3: 7537 ZONE 4: 8615 ZONE 5: 6616 # 2b. Index: From file ZONE 1: 5968 ZONE 2: 5345 ZONE 3: 7537 ZONE 4: 8615 ZONE 5: 6616
Well the outputs are consistent which is a good sign. As far as performance goes...
- Method 1: 40.26 seconds
- Method 2a: 30.13 seconds (~25% faster than Method 1)
- Method 2b: 24.38 seconds (~39% faster than Method 1 and 19% faster than Method 2a)
Comparing number of address checks using no index (brute force) v.s. with an index
|Garbage Zone||No Index||With Index||↓ in Checks|
For clarity the example above had been simplified, the code below is the generic version for doing spatial queries and includes options for contains, crosses, disjoint, equals, intersects, touches and within.