Reverse Geocoding using MySQL GIS

Normal geocoding is the process of taking an address and converting it into latitude and longitude. Reverse geocoding goes the other direction; given a latitude and longitude we want to know the closest street address.

Google Maps API, Yahoo Maps, Microsoft, GeoNames.Org and others can provide both these services, but they either cost, or have restrictive terms of service that makes them unavailable for high volume applications. See Geo::Coder::US (for US addresses only). This uses a local data source, and requires you to download the data from the US Census manually. It can import a format known as Tiger/Line into a local BerkeleyDB file, then uses that to calculate lat/lng values from street addresses. However, it provides no method for reverse lookups.

This is because the parsed data is simply a tree of values with the following form:

 /zip/street name/type//


Each key contains a list of numbers which are actually the spatial representation of the street, broken down into segments for each block. It is a series of line segments, which are called "shape files," because they describe the streets in spatial terms. For example:


/94702/Ashby/Ave// : $VAR1 = [
# record 1
37848907, # start lat (from) $value/1_000_000 = the decimal form.
122297583, # start lng
400, # even num start (range)
598, # even num end
0, # zero street marker
401, # odd num start
599, # odd num end
#record 2 .. N


# ends with final point
37853426,
122279480
];
] )


For forward geocoding, you simply start with a known address, then use that as a key to find the appropriate spatial data for that street. The lat/lon is an interpolated value.

For reverse geocoding, it's impossible to start with a lat/lng and return the street without iterating for each and every street and doing comparisons to find the "closest." To be able to do this efficiently, we need a spatial index on this data. CPAN has a module for R-Tree indexes, but mysql also provides a convenient way to store all this shape data, and do spatial queries.

This example uses a modified version of MySQL, which includes improved GIS support

Building A Reverse Geocoding DB

Get the Data
Here's a quick PERL script to spider the Census Bureau site and download all the data. It's split by State and county so there are a lot of files:


use HTML::Parser;
my $base_url = 'http://www2.census.gov/geo/tiger/tigerua';

package IdentityParse;
use base "HTML::Parser";
use Data::Dumper;
use LWP::Simple;

my @urls;
my $burl;

sub start {
my ($self, $tag ,$attr, $attrseq, $origtext) = @_;
unless ($tag =~ /^a$/) {
return;
}

if (defined $attr->{'href'}) {
if ($attr->{'href'} =~ /[A-Z]\/$/) { # directories: add to list
push @urls, "$burl/$attr->{'href'}";
}
if ($attr->{'href'} =~ /\.ZIP$/i ) { # zip file, download it
if (! -e $attr->{'href'}) {
# we don't have it yet!
`wget '$burl/$attr->{href}'`;
}
}
}
}


my $p = new IdentityParse;

push @urls, $base_url;
while (@urls) {
my $url = shift @urls;
my $content = get $url;
$burl = $url;
$p->parse($content);
}



Convert Using Geo::Coder::US

* Install Geo::Coder::US

This comes with a series of support scripts, among these is import_tiger_zip.pl which will import all the files into a BerkeleyDB file.

find /path/to/tigerline/data/ -name "*ZIP"  \
-exec ./import_tiger_zip.pl geocoder.db {} \;



The result is:

-rw-rw-r-- 1 root root 724M Apr 24 12:15 geocoder.db


Import into MySQL

Once you have the geocoder.db output, this script can be used to generate sql statements for inserting this into a spatially aware mysql database.


for i in `cat z2`; do perl ./bdb_mysql_export.pl \
geocoder.db $i >> geocoder.sql; done



#!/usr/bin/perl -w

use Geo::Coder::US;
use DB_File;
use Data::Dumper;
use strict;
use vars qw(%db *db);

my $filename = shift @ARGV or die "Usage: $0 \n";

Geo::Coder::US->set_db($filename);
my @streets;
my %streets;
for my $arg (@ARGV) {
my $val;
my $path = "/$arg/";
my $zip = $arg;
my ( $key, $value );
$Geo::Coder::US::DBO->seq( $key = $path, $value, R_CURSOR );
while ( $key and $value and $key =~ /^$path/i ) {
$streets{$key} =
[ split( ", ", join( ", ", unpack( "w*", $value ) ) ) ];
$Geo::Coder::US::DBO->seq( $key, $value, R_NEXT );
}

}
my $DEBUG = 0;

#print Dumper \%streets;
#my @data = values %streets;
foreach my $k (keys %streets) {
my @data = @{ $streets{$k} } ;
# if (lc($k) =~ /fairoaks/) { $DEBUG = 1; } else { $DEBUG = 0; };
# print "Doing street: $k\n" if ($DEBUG);
# print Dumper(\@data) if ($DEBUG);
# next if (!$DEBUG);
my $first = 1;
my (@from, @to, @range, @best, $matched, @range2);
shift @data;
while (@data) {
@from = splice( @data, 0, 2 ) if $data[0] > 1_000_000;
while (@data and $data[0] < 1_000_000) {
shift @data if not $data[0]; # skip street-side zero marker
@range = splice( @data, 0, 2 );
shift @data while @data and $data[0] < 1_000_000;
}
last unless @data;
@to = splice( @data, 0, 2 );

print "From : " . Dumper(\@from)."\n" if ($DEBUG);;
print "To : ". Dumper(\@to)."\n" if ($DEBUG);;
print "Range1: ". Dumper(\@range). "\n" if ($DEBUG);;
print "Range2: ". Dumper(\@range2) ."\n" if ($DEBUG);;
my %found;
@found{qw{ zip street type prefix suffix }} = split "/", substr($k, 1), 5;
@found{qw{ frlat frlong tolat tolong }} = map( $_ / 1_000_000, @from, @to );
@found{qw{ fradd toadd }} = @range;
$found{$_} *= -1 for qw/frlong tolong/;
print "Found :". Dumper(\%found)."\n" if ($DEBUG);
if ( defined($found{fradd})) {
$found{street} =~ s/'/''/g;
next if (!defined($found{tolong}));

my $q = "INSERT INTO reverse_geocode (zip, street, type, prefix, from_addr, to_addr, line_segment) VALUES ('$found{zip}','$found{street}','$found{type}', '$found{prefix}',$found{fradd}, $found{toadd}, GeomFromText('LineString($found{frlat} $found{frlong}, $found{tolat} $found{tolong})') );";
print $q ."\n"; # if ($DEBUG);
}
if (defined($data[0]) && $data[0] > 1_000_000) {
@to = splice(@data, 0,2 );
}
}
}



Here's the basic table structure:


CREATE TABLE `reverse_geocode` (
`zip` varchar(5) NOT NULL DEFAULT '0',
`street` varchar(64) NOT NULL DEFAULT '',
`type` varchar(8) NOT NULL DEFAULT '',
`from_addr` int(11) NOT NULL DEFAULT '0',
`to_addr` int(11) NOT NULL DEFAULT '0',
`line_segment` geometry NOT NULL,
`prefix` varchar(4) NOT NULL DEFAULT '',
SPATIAL KEY `line_segment` (`line_segment`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;




And the .sql file can be imported with something like:

 mysql -u -p geocode <>



The result is

mysql> show table status like 'reverse_geocode' \G
*************************** 1. row ***************************
Name: reverse_geocode
Engine: MyISAM
Version: 10
Row_format: Dynamic
Rows: 16005729
Avg_row_length: 81
Data_length: 1306599484
Max_data_length: 281474976710655
Index_length: 1021292544
Data_free: 0
Auto_increment: NULL
Create_time: 2009-04-27 13:08:31
Update_time: 2009-04-27 14:34:36
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)



Query the Data

Simply use the MBRContains fuction, and pass it a bounding box to cull results too far away. Sort by distance and take the closest result.


mysql> SET @center = GeomFromText('POINT(37.372241 -122.021671)');
Query OK, 0 rows affected (0.00 sec)

mysql> SET @radius = 0.005;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @bbox = GeomFromText(CONCAT('POLYGON((',
-> X(@center) - @radius, ' ', Y(@center) - @radius, ',',
-> X(@center) + @radius, ' ', Y(@center) - @radius, ',',
-> X(@center) + @radius, ' ', Y(@center) + @radius, ',',
-> X(@center) - @radius, ' ', Y(@center) + @radius, ',',
-> X(@center) - @radius, ' ', Y(@center) - @radius, '))')
-> );
Query OK, 0 rows affected (0.00 sec)

mysql> select zip, prefix, street, type, from_addr, to_addr, astext(line_segment), Distance(@center,line_segment) as dist
FROM reverse_geocode where MBRContains(@bbox, line_segment) order by dist limit 1;
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
| zip | prefix | street | type | from_addr | to_addr | astext(line_segment) | dist |
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
| 94086 | S | Fairoaks | Ave | 323 | 331 | LINESTRING(37.375141 -122.020671,37.372241 -122.021671) | 0 |
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
1 row in set (0.00 sec)


Now for my project, this was good enough; knowing the closest block of the street produced the results I wanted.

For more accuracy, there are still a few issue that need to be solved:

1. For the even/odd side of each block the same line segment is used. In this case I only keep one side of the street, however the extraction script would have to be modified a little bit if you wanted both sides. It's probably most effective to add two more from/to address columns in the table to store the even/odd numbers.

2. You would have to use some basic geometry to figure out which side of the street the point is on, then use that to choose the even or odd numbers.

3. Using some more geometry, you could also find an approximate address along the line. Essentially you would interpolate a value, which may or may not be a valid address.