Normal geocoding is the process of taking an address and
converting it into latitude and longitude. Reverse geocoding goes
the other direction; given a latitude and longitude we want to
know the closest street address.
Google Maps API, Yahoo Maps, Microsoft, GeoNames.Org and
others can provide both these services, but they either cost, or
have restrictive terms of service that makes them unavailable for
high volume applications. See Geo::Coder::US (for US addresses only). This
uses a local data source, and requires you to download the data
from the US Census manually. It can import a format known as
Tiger/Line into a local BerkeleyDB file, then uses that to calculate
lat/lng values from street addresses. However, it provides no
method for reverse lookups.
This is because the parsed data is simply a tree of values with
the following form:
/zip/street name/type//
Each key contains a list of numbers which are actually the
spatial representation of the street, broken down into segments
for each block. It is a series of line segments, which are called
"shape files," because they describe the streets in spatial
terms. For example:
/94702/Ashby/Ave// : $VAR1 = [
# record 1
37848907, # start lat (from) $value/1_000_000 = the decimal form.
122297583, # start lng
400, # even num start (range)
598, # even num end
0, # zero street marker
401, # odd num start
599, # odd num end
#record 2 .. N
# ends with final point
37853426,
122279480
];
] )
For forward geocoding, you simply start with a known address,
then use that as a key to find the appropriate spatial data for
that street. The lat/lon is an interpolated value.
For reverse geocoding, it's impossible to start with a lat/lng
and return the street without iterating for each and every street
and doing comparisons to find the "closest." To be able to do
this efficiently, we need a spatial index on this data. CPAN has
a module for R-Tree indexes, but mysql also provides a
convenient way to store all this shape data, and do spatial
queries.
This example uses a modified version of MySQL,
which includes improved GIS support
Building A Reverse Geocoding DB
Get the Data
Here's a quick PERL script to spider the Census Bureau site and
download all the data. It's split by State and county so there
are a lot of files:
use HTML::Parser;
my $base_url = 'http://www2.census.gov/geo/tiger/tigerua';
package IdentityParse;
use base "HTML::Parser";
use Data::Dumper;
use LWP::Simple;
my @urls;
my $burl;
sub start {
my ($self, $tag ,$attr, $attrseq, $origtext) = @_;
unless ($tag =~ /^a$/) {
return;
}
if (defined $attr->{'href'}) {
if ($attr->{'href'} =~ /[A-Z]\/$/) { # directories: add to list
push @urls, "$burl/$attr->{'href'}";
}
if ($attr->{'href'} =~ /\.ZIP$/i ) { # zip file, download it
if (! -e $attr->{'href'}) {
# we don't have it yet!
`wget '$burl/$attr->{href}'`;
}
}
}
}
my $p = new IdentityParse;
push @urls, $base_url;
while (@urls) {
my $url = shift @urls;
my $content = get $url;
$burl = $url;
$p->parse($content);
}
Convert Using Geo::Coder::US
* Install Geo::Coder::US
This comes with a series of support scripts, among these is
import_tiger_zip.pl which will import all the files into a
BerkeleyDB file.
find /path/to/tigerline/data/ -name "*ZIP" \
-exec ./import_tiger_zip.pl geocoder.db {} \;
The result is:
-rw-rw-r-- 1 root root 724M Apr 24 12:15 geocoder.db
Import into MySQL
Once you have the geocoder.db output, this script can be used to
generate sql statements for inserting this into a spatially aware
mysql database.
for i in `cat z2`; do perl ./bdb_mysql_export.pl \
geocoder.db $i >> geocoder.sql; done
#!/usr/bin/perl -w
use Geo::Coder::US;
use DB_File;
use Data::Dumper;
use strict;
use vars qw(%db *db);
my $filename = shift @ARGV or die "Usage: $0 \n";
Geo::Coder::US->set_db($filename);
my @streets;
my %streets;
for my $arg (@ARGV) {
my $val;
my $path = "/$arg/";
my $zip = $arg;
my ( $key, $value );
$Geo::Coder::US::DBO->seq( $key = $path, $value, R_CURSOR );
while ( $key and $value and $key =~ /^$path/i ) {
$streets{$key} =
[ split( ", ", join( ", ", unpack( "w*", $value ) ) ) ];
$Geo::Coder::US::DBO->seq( $key, $value, R_NEXT );
}
}
my $DEBUG = 0;
#print Dumper \%streets;
#my @data = values %streets;
foreach my $k (keys %streets) {
my @data = @{ $streets{$k} } ;
# if (lc($k) =~ /fairoaks/) { $DEBUG = 1; } else { $DEBUG = 0; };
# print "Doing street: $k\n" if ($DEBUG);
# print Dumper(\@data) if ($DEBUG);
# next if (!$DEBUG);
my $first = 1;
my (@from, @to, @range, @best, $matched, @range2);
shift @data;
while (@data) {
@from = splice( @data, 0, 2 ) if $data[0] > 1_000_000;
while (@data and $data[0] < 1_000_000) {
shift @data if not $data[0]; # skip street-side zero marker
@range = splice( @data, 0, 2 );
shift @data while @data and $data[0] < 1_000_000;
}
last unless @data;
@to = splice( @data, 0, 2 );
print "From : " . Dumper(\@from)."\n" if ($DEBUG);;
print "To : ". Dumper(\@to)."\n" if ($DEBUG);;
print "Range1: ". Dumper(\@range). "\n" if ($DEBUG);;
print "Range2: ". Dumper(\@range2) ."\n" if ($DEBUG);;
my %found;
@found{qw{ zip street type prefix suffix }} = split "/", substr($k, 1), 5;
@found{qw{ frlat frlong tolat tolong }} = map( $_ / 1_000_000, @from, @to );
@found{qw{ fradd toadd }} = @range;
$found{$_} *= -1 for qw/frlong tolong/;
print "Found :". Dumper(\%found)."\n" if ($DEBUG);
if ( defined($found{fradd})) {
$found{street} =~ s/'/''/g;
next if (!defined($found{tolong}));
my $q = "INSERT INTO reverse_geocode (zip, street, type, prefix, from_addr, to_addr, line_segment) VALUES ('$found{zip}','$found{street}','$found{type}', '$found{prefix}',$found{fradd}, $found{toadd}, GeomFromText('LineString($found{frlat} $found{frlong}, $found{tolat} $found{tolong})') );";
print $q ."\n"; # if ($DEBUG);
}
if (defined($data[0]) && $data[0] > 1_000_000) {
@to = splice(@data, 0,2 );
}
}
}
Here's the basic table structure:
CREATE TABLE `reverse_geocode` (
`zip` varchar(5) NOT NULL DEFAULT '0',
`street` varchar(64) NOT NULL DEFAULT '',
`type` varchar(8) NOT NULL DEFAULT '',
`from_addr` int(11) NOT NULL DEFAULT '0',
`to_addr` int(11) NOT NULL DEFAULT '0',
`line_segment` geometry NOT NULL,
`prefix` varchar(4) NOT NULL DEFAULT '',
SPATIAL KEY `line_segment` (`line_segment`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
And the .sql file can be imported with something like:
mysql -u -p geocode <>
The result is
mysql> show table status like 'reverse_geocode' \G
*************************** 1. row ***************************
Name: reverse_geocode
Engine: MyISAM
Version: 10
Row_format: Dynamic
Rows: 16005729
Avg_row_length: 81
Data_length: 1306599484
Max_data_length: 281474976710655
Index_length: 1021292544
Data_free: 0
Auto_increment: NULL
Create_time: 2009-04-27 13:08:31
Update_time: 2009-04-27 14:34:36
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
Query the Data
Simply use the MBRContains fuction, and pass it a bounding box to
cull results too far away. Sort by distance and take the closest
result.
mysql> SET @center = GeomFromText('POINT(37.372241 -122.021671)');
Query OK, 0 rows affected (0.00 sec)
mysql> SET @radius = 0.005;
Query OK, 0 rows affected (0.00 sec)
mysql> SET @bbox = GeomFromText(CONCAT('POLYGON((',
-> X(@center) - @radius, ' ', Y(@center) - @radius, ',',
-> X(@center) + @radius, ' ', Y(@center) - @radius, ',',
-> X(@center) + @radius, ' ', Y(@center) + @radius, ',',
-> X(@center) - @radius, ' ', Y(@center) + @radius, ',',
-> X(@center) - @radius, ' ', Y(@center) - @radius, '))')
-> );
Query OK, 0 rows affected (0.00 sec)
mysql> select zip, prefix, street, type, from_addr, to_addr, astext(line_segment), Distance(@center,line_segment) as dist
FROM reverse_geocode where MBRContains(@bbox, line_segment) order by dist limit 1;
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
| zip | prefix | street | type | from_addr | to_addr | astext(line_segment) | dist |
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
| 94086 | S | Fairoaks | Ave | 323 | 331 | LINESTRING(37.375141 -122.020671,37.372241 -122.021671) | 0 |
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
1 row in set (0.00 sec)
Now for my project, this was good enough; knowing the closest
block of the street produced the results I wanted.
For more accuracy, there are still a few issue that need to be
solved:
1. For the even/odd side of each block the same line segment is
used. In this case I only keep one side of the street, however
the extraction script would have to be modified a little bit if
you wanted both sides. It's probably most effective to add two
more from/to address columns in the table to store the even/odd
numbers.
2. You would have to use some basic geometry to figure out which
side of the street the point is on, then use that to choose the
even or odd numbers.
3. Using some more geometry, you could also find an approximate
address along the line. Essentially you would interpolate a
value, which may or may not be a valid address.