XOR Media

High Performance Web - Reducing Database Round Trips

Background

There are two main sources of latency in the backend of web applications: rendering (HTML templating or data serialization) and IO (database or external service calls.) Today we’ll look at the latter and more specifically focus getting rid of extraneous database round trips. The fastest query possible is one that doesn’t happen.

Before we get in to the details I’ll mention the importance of schema, but won’t spend time on it here. No single factor will affect the overall responsiveness of your application, and more specifically your ability to tune it, than an effective data(base) schema. This is something we’ll return to with detail in the future, but I bring it up now as schema issues can often limit your options.

I’ll also mention caching. In an optimal situation the things we’ll talk about will help to reduce the amount of time required to serve a result that can’t be cached or is a cache miss, but just as reducing the number of database round trips by optimizing querying helps, avoiding them altogether via caching is a big win.

A final note, I’ll discuss this topic in the context of Python (Django,) but it applies equally well in any language and ORM. I cut my teeth with Java (Hibernate) and have worked with Perl (DBIx::Class,) Ruby (Rails,) and several others since.

The N+1 Selects Problem

The most common and egregious round-trip problem is the n+1 selects problem. In it’s simplest form it involves an object with children, a one-to-many relationship. A minimal example follows.

from django.db import models


class State(models.Model):
    name = models.CharField(max_length=64)
    country = models.ForeignKey(Country, related_name='states')

    class Meta:
        ordering = ('name',)


class City(models.Model):
    name = models.CharField(max_length=64)
    state = models.ForeignKey(State, related_name='cities')

    class Meta:
        ordering = ('name',)

We have states which have zero or more cities and our use case is to print out a nested list of states and their cities.

Alaska
    Anchorage
    Fairbanks
    Willow
California
    Berkeley
    Monterey
    Palo Alto
    San Diego
    San Francisco
    Santa Cruz
Kentucky
    Albany
    Monticello
    Lexington
    Louisville
    Somerset
    Stamping Ground

The code to accomplish this looks something like:

from django.shortcuts import render_to_response
from django.template.context import RequestContext
from locations.models import State

def list_locations(request):
    data = {'states': State.objects.all()}
    return render_to_response('list_locations.html', data,
                              RequestContext(request))
...
<ul>

</ul>
...

If we run the above code to generate an HTML page and look at the database queries, using django-debug-toolbar, we’ll see a single query to list the states and then a query for each of the states to retrieve the cities. With 3 states that’s not a huge deal, but if it were all 50 we’d be looking at 1 query, the +1, to get the states and 50, the n, to get all of the cities.

2N+1 (no that’s not a thing)

Before we fix the N+1 queries I want to add an additional piece of information for each state, it’s country. This one will go in the other direction on a one-to-many relationship. Each state will have exactly one country.

Alaska (United States)
...
...

class Country(models.Model):
    name = models.CharField(max_length=64)

class State(models.Model):
    name = models.CharField(max_length=64)
    country = models.ForeignKey(Country, related_name='states')

...
...
<li> ()
...

If we take a look at the SQL tab of django-debug-toolbar we’ll see there are now queries happening to pull in the country object for each state. Also note that we’re retrieving the same country over and over, since it’s shared by all of the states.

2N+1 queries, not
good

Now we have a couple interesting problems to solve, conveniently one for each of Django ORM’s primary solutions.

select_related

states = State.objects.select_related('country').all()

select_related works by joining in SQL between the primary object (state) and the other object (country.) In this case that will get rid of the extra queries to fetch the country objects for each state. So if a database round-trip takes 20ms to transit the network, run, and return we’ll see a N * 20ms speedup. For any reasonable size of N that’ll be noticeable.

The new query to retrieve the state objects will look something like the following.

SELECT ... FROM "locations_state"
    INNER JOIN "locations_country" ON
        ("locations_state"."country_id" = "locations_country"."id")
    ORDER BY "locations_state"."name" ASC
...

The above query replaces the original state fetching query and rids us of the secondary queries to retrieve the country objects. There is however a potential downside to solving the problem this way, namely we’re getting back the same country object repeatedly and thus having to send it’s columns through the ORM code over and over and creating lots of duplicate objects. We’ll come back to this in a moment.

One last thing before we move on, in Django’s ORM select_related does not work when there are multiple objects on the other end of an association. We can make use of it to grab the country for a state, but if we were to add ‘cities’ to the select_related call, nothing would happen. Other ORMs (Hibernate for example) don’t share this limitation, but you must be very careful when making use of it in those situations. In those cases the join will duplicate the primary object’s data for each secondary object. That can quickly balloon out of control and leave the ORM processing lots of data/rows.

Given all of this, one of the best uses of select_related is when you’re fetching a single object and want to pull back (a) related object in a single shot. It’ll do so in a single round-trip to the database and won’t end up duplicating too much data, any in the case of Django ORM’s one-to-one only support.

prefetch_related

states = State.objects.prefetch_related('country', 'cities').all()

In contrast prefetch_related works by collecting all of the ids for the related objects, fetches them in a single batch, and then transparently attaches them to the objects. The best part is that it can be used on one-to-many relationships, which is what the state to cities relationship is in our example.

The SQL will now look something like the following.

SELECT ... FROM "locations_state" ORDER BY "locations_state"."name" ASC
SELECT ... FROM "locations_country" WHERE "locations_country"."id" IN (1)
SELECT ... FROM "locations_city"
    WHERE "locations_city"."state_id" IN (1, 2, 3)
    ORDER BY "locations_city"."name" ASC

So we’ve gone from 2N+1 queries to 3. Getting rid of N is a big deal. 3 * 20ms is way less than (2 * 50 + 1) * 20ms or even (50 + 1) * 20ms in the case of using select_related alone.

The above example uses prefetch on both country and cities. As mentioned in the previous section all of the states share a single country and when using select_related that means we pulled back and processed its data N times along side each state object. In contrast using prefetch_related we’ll only pull back the country object once. This however comes at the cost of an extra db round trip so it may be preferable to use a combination, you’ll have to test in your specific circumstances. However, in this case using both select_related and prefetch_related takes us down to 2 * 20ms, it’ll probably be faster than 3 queries, but there are a lot of potential factors.

states = State.objects.select_related('country') \
    .prefetch_related('cities').all()

2 queries, pretty
good

How Deep

What about when you want to span multiple levels? Both select_related and prefetch_related can traverse relationships using double-underscores. When you make use of the ability the intermediate objects will be included as well. This can be extremely powerful, but a little hard to grok in more complex object models.

# only works when there's a single object at each step
city = City.objects.select_related('state__country').all()[0]
# 1 query, no further db queries
print('{0} - {1} - {2}'.format(city.name, city.state.name,
                               city.state.country.name)

# works for both single and multiple object relationships
countries = Country.objects.prefetch_related('states__cities')
# 3 queries, no further db queries
for country in countries:
    for state in country.states:
        for city in state.cities:
            print('{0} - {1} - {2}'.format(city.name, city.state.name,
                                           city.state.country.name)

prefetch_related on raw queries

One final Django specific note. In last week’s post on efficiently querying for nearby things I worked through a complicated SQL query to find the closest objects to a given lat/lon point. The best way to go about doing this sort of query with Django is to use a raw sql query. The issue is that raw query sets don’t support prefetch_related, which is a bummer. There is however a work-around, you can directly make use of prefetch_related_objects, which Django uses under the hood to power prefetch_related.

from django.db.models.query import prefetch_related_objects

# prefetch_related_objects requires a list, it won't work on a QuerySet so
# we need to convert with list()
cities = list(City.objects.raw('<sql-query-for-nearby-cities>'))
prefetch_related_objects(cities, ('state__country',))
# 3 queries, no further db queries
for city in cities:
    print('{0} - {1} - {2}'.format(city.name, city.state.name,
                                   city.state.country.name)

That’s pretty cool.