We are currently experiencing a Geospatial Revolution that changes in how we navigate from A to B and how we search for locations like a specific sight or restaurants nearby. Geospatial search technology provides such information. This article shows how commercial applications can utilize geospatial search, e.g. for real estate search (qualifing real estates by their distance to the nearest kindergartens, schools, doctors, etc.), calculating building density in cities and so on.
Let’s think for a moment, who offers geospatial information? Google search sometimes finds geospatial results, but it’s only partially possible to search for them specifically, even in Google Maps. And one may not want to become too dependent from a commercial provider like Google.
Like Wikipedia, the trailblazer for knowledge, the OpenStreetMap service has been around for a long time and by now offers extensive data with excellent quality. Everyone can register at OpenStreetMap and improve maps using a simple-to-use Editor. Many volunteers have already contributed with GPS trackers and so quite comprehensive information is already available. A simple visit to www.openstreetmap.org allows everyone to get an overview of the available information for example by checking the map of their place of birth or residence. The maps are often so detailed that they even show the outlines of houses. Why this is important, will be discussed later on.

OpenStreetMap of Nuremberg with zoom to the "Bundesagentur für Arbeit" which clearly shows the outline of the separate buildings. Many details are not even plotted into the maps.
Please note that the data quality of OpenStreetMap is continuously improving. Knowledge of geospatial details is not lost but added to and refined all the time. As the data is also used in commercial applications (licensed), there is also a strong business interest in the ongoing maintenance of the data. The more people use OpenStreetMap, the higher the will to contribute themselves, to report mistakes or to add further information. An even more exciting option than the maps themselves is the opportunity to download the source data from OpenStreetMap to use them for other purposes. In the following chapters, we will show how to prepare and use them in different contexts.
Indexing Geospatial Data
The OpenStreetMap website offers the complete map data for download. However, there are only downloads of either rectangular extracts or the whole earth available both of which are problematic. The data of the whole earth is extremely vast while the data of a rectangle is often not completely consistent as certain areas at the edges of the rectangle overlap. Luckily there are also consistent extracts for e.g. whole countries like Germany, separate federal states or even administrative regions like German “Regierungsbezirke”. These are, however, only created periodically and so may not automatically reflect the status depicted on the maps. Download formats are XML (packed) or ProtocolBuffer. The latter is an extremely compressed format created by Google and initially used for message exchange. There are parsers for many widely used languages. However, the XML format is better suited for experimenting as humans can read it easier and process it semantically.Selection of a Solution for Indexing and Search
Once the source data of OpenStreetMap is locally available, it has to be stored in a way that allows it to be searched by distances (“geospatial search”). Some database systems like PostgreSQL, Oracle or Microsoft SQL Server already offer suitable functionalities for indexing and search. As we are mainly interested in searching (and faceting), our solution uses a software that is specifically optimized for this use case, i.e. Apache Solr. Since version 4.0, Solr contains extensive options to search geospatial data. Solr is able to represent the indexed OpenStreetMap data as XML source text. Visualization as maps requires more infrastructure and is (so far) not part of our solution.Creation of a Schema
Similar to relational databases, which rely on a definition of entities and relations, Solr requires a so called “schema” for data indexing. However, the data in Solr is “flat” – contrary to relational systems – which simply put means that there is only one table. Solr can work with different data formats; in our case “string” is the most relevant apart from the actual coordinates, as it allows many attributes (see below) to be mapped. The support offered for geospatial coordinates is quite exciting and there are two different approaches:LatLonType
: This data type can store one (or many) points within the Solr index. Storing is done in the form of two float values so they can be retrieved really fast.SpatialRecursivePrefixTreeFieldType
: This data type is much more flexible and uses so-called Geohashes to store data. This requires more storage space but in return offers more search functionality. And it also adds an option to store geometrical objects like polygons or lines in addition to points.
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="geo" type="location_rpt" indexed="true" stored="true" /> <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" spatialContextFactory="com.spatial4j.core.context.jts.\ JtsSpatialContextFactory" geo="true" distErrPct="0.01" maxDistErr="0.0009" units="degrees" />We work with a maximum error of 1% in distances and a maximum absolute error of about 100 meters which is absolutely sufficient for our purposes. Improving these values leads to a significantly larger index and slower search.
Converting the Data
Once the schema is complete, the existing OpenStreetMap data can be imported into Solr. The usual approach would be to write (or configure) a DataImportHandler for the import of XML data. Unfortunately this is not feasible in our case as there are various data types in the OpenStreetMap XML some of which reference one another, i.e. the order of the import is important and has to be de-referenced. In addition, we do not want to import all the data into the Solr index, but to filter it in advance (see below). A fragment of an OpenStreetMap files looks like this:<node id="26212605" lat="49.588555" lon="11.0014352" version="7" timestamp="2013-01-08T17:55:18Z" changeset="14577549" uid="479256" user="geodreieck4711"/> <way id="2293021" version="12" timestamp="2013-01-08T17:55:13Z" changeset="14577549" uid="479256" user="geodreieck4711"> <nd ref="26212605"/> <nd ref="9919314"/> <nd ref="2101861553"/> <nd ref="10443807"/> <tag k="bicycle" v="designated"/> <tag k="cycleway" v="segregated"/> <tag k="foot" v="designated"/> <tag k="highway" v="cycleway"/> <tag k="segregated" v="yes"/> <tag k="surface" v="asphalt"/> </way>The code shows that the nodes (i.e. the points) are referenced in so-called “ways”. This kind of reference is, however, not part of Solr and so our software needs to resolve this reference. As OpenStreetMap for Germany contains a few million nodes, de-referencing is not an easy to solve problem. We chose an approach using a key-value database (LevelDB). Polygons or lines are displayed in the so-called Well-Known-Text-Format (WKT) in Solr. The respective objects have to be converted from the OpenStreetMap format to WKT. We learned in our project that this is not quite so simple. WKT uses the orientation of the polygons to decide about inside and outside areas of the polygons. This, however, does not matter in the least for OpenStreetMap. Many polygons there use an incorrect orientation. So, all polygons have to be converted to the correct orientation (anti-clockwise), which can be achieved by calculating the area. The implementation of so-called relations is even more difficult than that of the above mentioned paths. Relations can connect paths with each other and so create non-continuous polygons. Luckily, there is also a multi-polygon form in WKT and Solr. Polygons can also contain “holes”. This can be mapped in WKT, but is irrelevant for our use case (e.g. there could be a hole in a wood if there is a lake, however, the hole is irrelevant for all distance calculations from outside).
Filtering
OpenStreetMap contains an unbelievable amount of interesting data for all kinds of questions. However, some objects (e.g. the location of fire hydrants or the color of park benches) are irrelevant when trying to qualify an address or for vicinity search. To keep memory requirements for indexing minimal we filtered them out. The same applies for certain, not immediately relevant contours like e.g. buildings, agricultural roads, etc. The solution can, however, be configured to use these attributes as well should the need arise. The actual implementation of the filter consists of two steps: objects to be considered are defined via positive list. Some may later on be identified as irrelevant and will then be removed via negative list. As certainly not all attributes are relevant, the irrelevant are removed dynamically. The result of these steps is data ready for direct indexing by Solr.Indexing with Apache Solr
The next step is indexing the relevant data. This is processed as a batch job that can be parallelized as CPU is the limiting factor for geospatial data (especially contours). On a standard PC with 8 GB RAM and a 2.5 GHz Quad-Core processor all data on Germany can be indexed in approximately a day. We added some further optimizations like caching, commit-interval, etc. but describing them would exceed the scope of this paper.Once this step is completed, we have a Solr-indexed version of OpenStreetMap data available to be used for searches.
Examples of Search in Geospatial Data
Proximity search is a fairly basic and general search which Solr can implement without any problems. For example let’s take a look at the geospatial coordinated of our Nuremberg office (49.447206,11.102245) and search all objects within a radius of 1 kilometer (max.) . The results is a Solr search like:{!geofilt sfield=geo pt=49.447206,11.102245 d=1}
. After only a short while (much less than 1 sec), we get results, however, there are 630 of them with only the first 10 displayed (this can be configured quite easily). The results are not sorted by distance as this is currently a quite memory intensive task in Solr and not an advisable solution for the high amount of data in the index. It is, however, easy to reduce the distance and rerun the search. Faceting is one of the most interesting aspects of the search when using Solr. This means simulating additional search criteria and showing the number of results if these were applied. If for example we facet via the field amenity the result is the following:<lst name="amenity"> <int name="recycling">22</int> <int name="restaurant">22</int> <int name="telephone">18</int> <int name="pub">16</int> <int name="kindergarten">13</int> <int name="post_box">13</int> <int name="fast_food">9</int> <int name="school">9</int> <int name="vending_machine">8</int> <int name="place_of_worship">7</int> <int name="biergarten">5</int> <int name="cafe">5</int> <int name="university">4</int> <int name="bank">3</int> <int name="fuel">3</int> <int name="pharmacy">3</int> <int name="doctors">2</int> <int name="hospital">2</int> </lst>You can facet not only by amenity but also by any other indexed field, like e.g. landuse etc. It is also possible to facet functionally, e.g. to calculate the sum of surfaces (in the future). The options for the qualification of data that result are quite extensive and can be used for exploration as well as statistics. Faceting is not possible in classic databases and this is another reason why we chose to use Solr in our solution.