Efficiently Querying for Nearby Things

2013-05-06

It’s a fairly common use case to have a latitude and longitude and want to find the closest objects to a given point. While there are heavyweight solutions: MySQL Spatial Extensions, PostGIS, they can be more trouble than they’re worth especially if you’re making use of an ORM.

Euclidean Distance (naive, flawed, but simple)

The most naive approach is to calculate the Euclidean distance between the objects and point. It works OK so long as the distances are small and you’re closer to the equator than you are to the poles. You will likely recognize the math from high school geometry class, a^2 + b^2 = c^2. A solution in sql (MySQL) might look something like the following.

CREATE TABLE Restaurant (
    id INTEGER AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(64) NOT NULL,
    lat DOUBLE NOT NULL,
    lon DOUBLE NOT NULL
);

set @point_lat = 37.757001;
set @point_lon = -122.420588;
set @radius = 100;
SELECT * from (SELECT id, name, lat, lon, 
               sqrt(pow(@point_lat - lat, 2) + pow(@point_lon - lon, 2)) 
               * 110574.61087757687 as distance
               FROM Restaurant) i
    WHERE distance < @radius ORDER BY distance ASC;

We’re computing the square root of the differences between the two points squared. If it’s less than our search radius we let it through and sort the passing Restaurants by their distance away from the search point. Note the use of a sub-select. Depending on the level of strictness you have MySQL running with (keep your eye out for an upcoming post) you cannot use an aggregate in a where clause. WHERE is applied before computing aggregates. Alternatively you could make use of HAVING, but I have chosen this route. The other thing that needs explaining is the number 110574.61087757687. It’s the distance in meters of a single degree of latitude. You’ll see it used throughout as we’ll be working in meters.

Performance (too much math)

There are two problems here, the first one is an issue of performance. As written the above query will run 2 subtractions, take two things to the power of 2, and then do a square root on every single row in the Restaurant table. As soon as you have a city’s worth of rows that’s going to be a performance problem and it’ll only get worse from there.

+----+-------------+------------+------+---------------+------+---------+------+-------+-----------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows  | Extra                       |
+----+-------------+------------+------+---------------+------+---------+------+-------+-----------------------------+
|  1 | PRIMARY     | <derived2> | ALL  | NULL          | NULL | NULL    | NULL | 35906 | Using where; Using filesort |
|  2 | DERIVED     | Restaurant | ALL  | NULL          | NULL | NULL    | NULL | 32160 |                             |
+----+-------------+------------+------+---------------+------+---------+------+-------+-----------------------------+
2 rows in set (0.08 sec)

There’s a relatively easy solution. We need to consider fewer rows and avoid doing as much computation as possible. The trick is to create a bounding box and only consider rows that fall inside of it. Since we are searching within a radius we know that anything that falls outside of a box with a width of 2 * radius centered around our point will fall outside of a circle of the same size.

set @r = (@radius / 110574.61087757687);
set @lat_min = @point_lat - @r;
set @lat_max = @point_lat + @r;
set @r = (@radius / 110574.61087757687);
set @lon_min = @point_lon - @r;
set @lon_max = @point_lon + @r;

SELECT * from (SELECT id, name, lat, lon, 
               sqrt(pow(@point_lat - lat, 2) + pow(@point_lon - lon, 2)) 
               * 110574.61087757687 as distance
               FROM Restaurant 
               WHERE lat BETWEEN @lat_min AND @lat_max AND
                     lon BETWEEN @lon_min AND @lon_max) i
    WHERE distance < @radius ORDER BY distance ASC;

We’ve added a where clause to the inner select that will filter out rows that have no chance at being inside our radius so when our search point is in San Francisco and we’re looking for stuff within 1000m we won’t bother doing the math for things in NYC, or even on the other side of town.

That’s not quite enough though. If we use explain plan we’ll see that a table scan is occurring to do the bounding box check. We’ll need an index to address this.

... explain query ...

+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-----------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra                       |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-----------------------------+
|  1 | PRIMARY     | <derived2> | ALL  | NULL          | NULL | NULL    | NULL |   122 |   100.00 | Using where; Using filesort |
|  2 | DERIVED     | Restaurant | ALL  | NULL          | NULL | NULL    | NULL | 32160 |   100.00 | Using where                 |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-----------------------------+
2 rows in set, 1 warning (0.05 sec)

CREATE INDEX Restaurant_lat_lon ON Restaurant(lat, lon);

... explain query ...

+----+-------------+------------+-------+--------------------+--------------------+---------+------+------+----------+-----------------------------+
| id | select_type | table      | type  | possible_keys      | key                | key_len | ref  | rows | filtered | Extra                       |
+----+-------------+------------+-------+--------------------+--------------------+---------+------+------+----------+-----------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL               | NULL               | NULL    | NULL |  122 |   100.00 | Using where; Using filesort |
|  2 | DERIVED     | Restaurant | range | Restaurant_lat_lon | Restaurant_lat_lon | 16      | NULL | 1043 |   100.00 | Using where                 |
+----+-------------+------------+-------+--------------------+--------------------+---------+------+------+----------+-----------------------------+
2 rows in set, 1 warning (0.01 sec)

Even in my tiny test dataset there’s over an order of magnitude reduction in the number of rows considered and it sped things up around 5x (though admittedly both are fast and two runs is hardly a scientific comparison.)

Correctness (now that’s much better)

So now that things are fast and we’re avoiding doing math as much as possible we can do a much better job at accurately computing the distance between points. The Earth is not flat, it’s a sphere. While a degree of latitude is constant anywhere it occurs, the same can’t be said for longitude. A degree of longitude is much smaller near the poles than the equator.

Enter the Haversine Formula which accurately computes great-circle distances, the distance between two points on a sphere, and while the Earth isn’t a perfect sphere it’s much closer to one than a plane. I won’t bother explaining the math, wikipedia already does so and I’d just mess it up. I will however translate it in to SQL for you. We’ll also adjust our bonding box calculation to account for longitudinal variation.

set @r = (@radius / 110574.61087757687);
set @lat_min = @point_lat - @r;
set @lat_max = @point_lat + @r;
set @r = (@radius / abs(cos(radians(@point_lat)) * 110574.61087757687));
set @lon_min = @point_lon - @r;
set @lon_max = @point_lon + @r;

SELECT * from (SELECT id, name, lat, lon, 12756200 *
               asin(sqrt(pow(sin(radians(@point_lat - lat) * 0.5), 2) +
                         cos(radians(@point_lat)) *
                         cos(radians(lat)) *
                         pow(sin(radians(@point_lon - lon) * 0.5), 2)))
               as distance FROM Restaurant
               WHERE lat BETWEEN @lat_min AND @lat_max AND
                     lon BETWEEN @lon_min AND @lon_max) i
    WHERE distance < @radius ORDER BY distance ASC;

If you look closely you’ll notice a new magic number in there, 12756200 which is the diameter of the earth in meters. This query will perform roughly the same as the previous performance enhanced version even though it’s doing more complicated math. If we weren’t filtering with a bounding box we’d see a much more apparent slow down.

I plan to follow this post up next with with one on how I implemented this solution with a Django model focusing on optimizing the queries run to fetch related data. (found here.)

Note: the foundation for my final solution was a MySQL presentation on the subject, but I found it to be lacking in explanation and there were quite a few things that could (needed to really) be optimized. If you’d like to take a look at that document it can be found here.