This blog post will discuss creating geo-enabled applications with MongoDB, GeoJSON and MySQL.
Introduction
Recently I published a blog post about the new GIS features in MySQL 5.7. Today I’ve looked into how to use MongoDB (I’ve tested with 3.0 and 3.2, with 3.2 being much faster) for the same purpose. I will also talk about GIS in MySQL and MongoDB at Percona Live next week (together with my colleague Michael Benshoof).
MongoDB and GIS
MongoDB has a very useful feature called “geoNear.” There are other MongoDB spatial functions available to calculate the distance on a sphere (like the Earth), i.e.
$nearSphere
,
$centerSphere
,
$near
– but all of them have restrictions. The most important one is that they do not support sharding. The
geoNear
command in MongodDB, on the other hand, supports sharding. I will use
geoNear
in this post.
For this test, I exported Open Street Map data from MySQL to MongoDB (see the “Creating GEO-enabled applications with MySQL 5.6” post for more details on how to load this data to MySQL).
- Export the data to JSON. In MySQL 5.7, we can
use JSON_OBJECT to generate the JSON file:
mysql> SELECT JSON_OBJECT('name', replace(name, '"', ''), 'other_tags', replace(other_tags, '"', ''), 'geometry', st_asgeojson(shape)) as j FROM `points` INTO OUTFILE '/var/lib/mysql-files/points.json'; Query OK, 13660667 rows affected (4 min 1.35 sec)
- Use
mongoimport
to import JSON into MongoDB (I’m using 24 threads, -j 24, to use parallel import):
mongoimport --db osm --collection points -j 24 --file /var/lib/mysql-files/points.json 2016-04-11T22:38:10.029+0000 connected to: localhost 2016-04-11T22:38:13.026+0000 [........................] osm.points 31.8 MB/2.2 GB (1.4%) 2016-04-11T22:38:16.026+0000 [........................] osm.points 31.8 MB/2.2 GB (1.4%) 2016-04-11T22:38:19.026+0000 [........................] osm.points 31.8 MB/2.2 GB (1.4%) … 2016-04-11T23:12:13.447+0000 [########################] osm.points 2.2 GB/2.2 GB (100.0%) 2016-04-11T23:12:15.614+0000 imported 13660667 documents
- Create a 2d index:
mongo > use osm switched to db osm > db.points.createIndex({ geometry : "2dsphere" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 }
Another option would be using the osm2mongo Ruby script, which will convert the osm file and load it directly to MongoDB.
Now I can use the
geoNear
command to find all the restaurants near my location:
> db.runCommand( { geoNear: "points", near: { type: "Point" , coordinates: [ -78.9064543, 35.9975194 ]}, spherical: true, ... query: { name: { $exists: true, $ne:null}, "other_tags": { $in: [ /.*amenity=>restaurant.*/, /.*amenity=>cafe.*/ ] } }, "limit": 5, "maxDistance": 10000 } ) { "results" : [ { "dis" : 127.30183814835166, "obj" : { "_id" : ObjectId("570329164f45f7f0d66f8f13"), "name" : "Pop's", "geometry" : { "type" : "Point", "coordinates" : [ -78.9071795, 35.998501 ] }, "other_tags" : "addr:city=>Durham,addr:country=>US,addr:housenumber=>605,addr:street=>West Main Street,amenity=>restaurant,building=>yes" } }, { "dis" : 240.82201047521244, "obj" : { "_id" : ObjectId("570329df4f45f7f0d68c16cb"), "name" : "toast", "geometry" : { "type" : "Point", "coordinates" : [ -78.9039761, 35.9967069 ] }, "other_tags" : "addr:full=>345 West Main Street, Durham, NC 27701, US,amenity=>restaurant,website=>http://toast-fivepoints.com/" } }, ... }
MongoDB 3.0 vs 3.2 with geoNear
MongoDB 3.2 features Geospatial Optimization:
MongoDB 3.2 introduces version 3 of 2dsphere indexes, which index GeoJSON geometries at a finer gradation. The new version improves performance of 2dsphere index queries over smaller regions. In addition, for both 2d indexes and 2dsphere indexes, the performance of geoNear queries has been improved for dense datasets.
I’ve tested the performance of the above
geoNear
query with MongoDB 3.0 and MongoDB 3.2 (both the old and new versions of 2dsphere index). All the results statistics are for a
"limit": 5
and
"maxDistance": 10000
.
MongoDB 3.0, index version 2:
> db.points.getIndexes() ... { "v" : 1, "key" : { "geometry" : "2dsphere" }, "name" : "geometry_2dsphere", "ns" : "osm.points", "2dsphereIndexVersion" : 2 } ] "stats" : { "nscanned" : 1728, "objectsLoaded" : 1139, "avgDistance" : 235.76379903759667, "maxDistance" : 280.2681226202938, "time" : 12 },
MongoDB 3.2, index version 2:
> db.points.getIndexes() [ ... { "v" : 1, "key" : { "geometry" : "2dsphere" }, "name" : "geometry_2dsphere", "ns" : "osm.points", "2dsphereIndexVersion" : 2 } ] ... "stats" : { "nscanned" : 513, "objectsLoaded" : 535, "avgDistance" : 235.76379903759667, "maxDistance" : 280.2681226202938, "time" : 5 },
What is interesting here is that even with the
"2dsphereIndexVersion" : 2
, MongoDB 3.2 performs much faster and scans a much smaller number of documents.
MongoDB 3.2, index version 3:
> db.points.dropIndex("geometry_2dsphere") { "nIndexesWas" : 2, "ok" : 1 } > db.points.createIndex({ geometry : "2dsphere" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } > db.points.getIndexes() [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "osm.points" }, { "v" : 1, "key" : { "geometry" : "2dsphere" }, "name" : "geometry_2dsphere", "ns" : "osm.points", "2dsphereIndexVersion" : 3 } ] "stats" : { "nscanned" : 144, "objectsLoaded" : 101, "avgDistance" : 235.76379903759667, "maxDistance" : 280.2681226202938, "time" : 1 },
That is significantly faster, 1ms for five results!
MySQL and GeoJSON revisited
To compare to the performance of the above query, I’ve created a similar query in MySQL. First of all, we will need to use the good old bounding rectangle (envelope) trick to only include the points in the 10 miles radius (or so). If we don’t, MySQL will not be able to use spatial (RTREE) index. I’ve created the following function to generate the envelope:
DELIMITER // CREATE DEFINER = current_user() FUNCTION create_envelope(lat decimal(20, 14), lon decimal(20, 14), dist int) RETURNS geometry DETERMINISTIC begin declare point_text varchar(255); declare l varchar(255); declare p geometry; declare env geometry; declare rlon1 double; declare rlon2 double; declare rlat1 double; declare rlat2 double; set point_text = concat('POINT(', lon, ' ', lat, ')'); set p = ST_GeomFromText(point_text, 1); set rlon1 = lon-dist/abs(cos(radians(lat))*69); set rlon2 = lon+dist/abs(cos(radians(lat))*69); set rlat1 = lat-(dist/69); set rlat2 = lat+(dist/69); set l = concat('LineString(', rlon1, ' ', rlat1, ',', rlon2 , ' ', rlat2, ')'); set env= ST_Envelope(ST_GeomFromText(l, 1)); return env; end // DELIMITER ; mysql> set @lat= 35.9974043; Query OK, 0 rows affected (0.00 sec) mysql> set @lon = -78.9045615; Query OK, 0 rows affected (0.00 sec) mysql> select st_astext(create_envelope(@lat, @lon, 10)); +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | st_astext(create_envelope(@lat, @lon, 10)) | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | POLYGON((-79.08369589058249 35.852476764,-78.72542710941751 35.852476764,-78.72542710941751 36.142331836,-79.08369589058249 36.142331836,-79.08369589058249 35.852476764)) | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
Then we can use the following query (an update of the GeoJSON query from my previous post):
set @lat= 35.9974043; set @lon = -78.9045615; set @p = ST_GeomFromText(concat('POINT(', @lon, ' ', @lat, ')'), 1); set group_concat_max_len = 100000000; SELECT CONCAT('{ "type": "FeatureCollection", "features": [ ', GROUP_CONCAT('{ "type": "Feature", "geometry": ', ST_AsGeoJSON(shape), ', "properties": {"distance":', st_distance_sphere(shape, @p) , ', "name":"', name , '"} }' order by st_distance_sphere(shape, @p)), '] }') as j FROM points_new WHERE st_within(shape, create_envelope(@lat, @lon, 10)) and (other_tags like '%"amenity"=>"cafe"%' or other_tags like '%"amenity"=>"restaurant"%') and name is not null and st_distance_sphere(shape, @p) < 1000; ... 1 row in set (0.04 sec)
The time is slower: 40ms in MySQL compared to 1ms – 12ms in MongoDB. The box is AWS EC2 t2.medium.
To recap the difference between MongoDB
geoNear
and MySQL
st_distance_sphere
:
- MongoDB
geoNear
uses 2dsphere index, so it is fast; however, it can’t just calculate the distance between two arbitrary points - MySQL st_distance_sphere is a helper function and
will only calculate the distance between two points; it will
not use an index – we will have to use the
create_envelope
function to restrict the search so MySQL will use an index
Time-wise, this is not an apples to apples comparison as the query is quite different and uses a different technique.
Visualizing the results
Results for GeoJSON for Google Maps API:
{ "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [-78.9036457, 35.997125]}, "properties": {"distance":87.67869122893659, "name":"Pizzeria Toro"} },{ "type": "Feature", "geometry": {"type": "Point", "coordinates": [-78.9039761, 35.9967069]}, "properties": {"distance":93.80064086575564, "name":"toast"} },{ "type": "Feature", "geometry": {"type": "Point", "coordinates": [-78.9031929, 35.9962871]}, "properties": {"distance":174.8300018385443, "name":"Dame's Chicken and Waffles"} }, ... }
Now we can add those on a map:
Back to MongoDB: pluses and minuses
MongoDB uses Google’s S2 library to perform GIS calculations. The
geoNear
command is fast and easy to use for finding points of interests near you (which is the most common operation). However, full GIS support does not natively exist.
Another issue I came across when creating a 2dsphere index: MongoDB is very strict when checking the lines and polygons. For example:
> db.lines.createIndex({ geometry : "2dsphere" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "errmsg" : "exception: Can't extract geo keys: { _id: ObjectId('570308864f45f7f0d6dfbed2'), name: "75 North", geometry: { type: "LineString", coordinates: [ [ -85.808852, 41.245582 ], [ -85.808852, 41.245582 ] ] }, other_tags: "tiger:cfcc=>A41,tiger:county=>Kosciusko, IN,tiger:name_base=>75,tiger:name_direction_suffix=>N,tiger:reviewed=>no" } GeoJSON LineString must have at least 2 vertices: [ [ -85.808852, 41.245582 ], [ -85.808852, 41.245582 ] ]", "code" : 16755, "ok" : 0 }
MongoDB complains about this: type: “LineString”, coordinates: [ [ -85.808852, 41.245582 ], [ -85.808852, 41.245582 ] ]
This is a “bad” line string as the starting point and ending point are the same. I had to remove the bad data from my MongoDB imported dataset, which was tricky by itself. MongoDB (as opposed to MySQL) does not have a native way to compare the values inside the JSON, so I had to use $where construct – which is slow and acquires a global lock:
> db.lines.remove({"geometry.type": "LineString", "geometry.coordinates": {$size:2}, $where: "this.geometry.coordinates[0][0] == this.geometry.coordinates[1][0] && this.geometry.coordinates[0][1] == this.geometry.coordinates[1][1]" }) WriteResult({ "nRemoved" : 22 })
After that, I was able to add the 2dsphere index.
Conclusion
MongoDB looks good, is pretty fast and easy for geo-proximity search queries – until you go outside of the one function and need full GIS support (which does not natively exist). It may be trickier to implement other GIS functions like
st_contains or st_within
.
Update: as pointed out, MongoDB actually supports $geoWithin and $geoIntersects GIS functions.
Update 2: I was asked about MySQL and GeoJSON: why not to use the ST_MakeEnvelope function. One of the issues with
ST_MakeEnvelope
is that it only works with SRID 0 (it requires point geometry arguments with an SRID of 0) and OSM data is stored with SRID 1. But also I will need to “add” 10 miles to my point. The only way to do that is to calculate the new point, 10 miles apart from “my” point/location. I have to use a custom function to manipulate the lat/lon pair.
The explain plan for the MySQL GeoJSON query shows that MySQL uses SHAPE (Spatial) index:
mysql> explain SELECT CONCAT('{ "type": "FeatureCollection", "features": [ ', GROUP_CONCAT('{ "type": "Feature", "geometry": ', ST_AsGeoJSON(shape), ', "properties": {"distance":', st_distance_sphere(shape, @p) , ', "name":"', name , '"} }' order by st_distance_sphere(shape, @p)), '] }') as j FROM points_new WHERE st_within(shape, create_envelope(@lat, @lon, 10)) and (other_tags like '%"amenity"=>"cafe"%' or other_tags like '%"amenity"=>"restaurant"%') and name is not null and st_distance_sphere(shape, @p) < 1000; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: points_new partitions: NULL type: range possible_keys: SHAPE key: SHAPE key_len: 34 ref: NULL rows: 665 filtered: 18.89 Extra: Using where 1 row in set, 1 warning (0.00 sec)
And if I remove “st_within(shape, create_envelope(@lat, @lon, 10))” from the query it will show the full table scan:
mysql> explain SELECT CONCAT('{ "type": "FeatureCollection", "features": [ ', GROUP_CONCAT('{ "type": "Feature", "geometry": ', ST_AsGeoJSON(shape), ', "properties": {"distance":', st_distance_sphere(shape, @p) , ', "name":"', name , '"} }' order by st_distance_sphere(shape, @p)), '] }') as j FROM points_new WHERE (other_tags like '%"amenity"=>"cafe"%' or other_tags like '%"amenity"=>"restaurant"%') and name is not null and st_distance_sphere(shape, @p) < 1000; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: points_new partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 11368798 filtered: 18.89 Extra: Using where 1 row in set, 1 warning (0.00 sec)