XOR Media

Coding, Operations, Etc.

High Performance Web: Reducing Database Round Trips

Posted on Mon 13 May 2013 by

Background

There are two main sources of latency in the backend of web applications: rendering (HTML templating or data serialization) and IO (database or external service calls.) Today we'll look at the latter and more specifically focus getting rid of extraneous database round trips. The fastest query possible is one that doesn't happen.

Before we get in to the details I'll mention the importance of schema, but won't spend time on it here. No single factor will affect the overall responsiveness of your application, and more specifically your ability to tune it, than an effective data(base) schema. This is something we'll return to with detail in the future, but I bring it up now as schema issues can often limit your options.

I'll also mention caching. In an optimal situation the things we'll talk about will help to reduce the amount of time required to serve a result that can't be cached or is a cache miss, but just as reducing the number of database round trips by optimizing querying helps, avoiding them altogether via caching is a big win.

A final note, I'll discuss this topic in the context of Python (Django,) but it applies equally well in any language and ORM. I cut my teeth with Java (Hibernate) and have worked with Perl (DBIx::Class,) Ruby (Rails,) and several others since.

The N+1 Selects Problem

The most common and egregious round-trip problem is the n+1 selects problem. In it's simplest form it involves an object with children, a one-to-many relationship. A minimal example follows.

from django.db import models


class State(models.Model):
    name = models.CharField(max_length=64)
    country = models.ForeignKey(Country, related_name='states')

    class Meta:
        ordering = ('name',)


class City(models.Model):
    name = models.CharField(max_length=64)
    state = models.ForeignKey(State, related_name='cities')

    class Meta:
        ordering = ('name',)

We have states which have zero or more cities and our use case is to print out a nested list of states and their cities.

Alaska
    Anchorage
    Fairbanks
    Willow
California
    Berkeley
    Monterey
    Palo Alto
    San Diego
    San Francisco
    Santa Cruz
Kentucky
    Albany
    Monticello
    Lexington
    Louisville
    Somerset
    Stamping Ground

The code to accomplish this looks something like:

from django.shortcuts import render_to_response
from django.template.context import RequestContext
from locations.models import State

def list_locations(request):
    data = {'states': State.objects.all()}
    return render_to_response('list_locations.html', data,
                              RequestContext(request))
...
<ul>
{% for state in states %}
<li>{{ state.name }}
    <ul>
        {% for city in state.cities.all %}
        <li>{{ city.name }}</li>
        {% endfor %}
    </ul>
</li>
{% endfor %}
</ul>
...

If we run the above code to generate an HTML page and look at the database queries, using django-debug-toolbar, we'll see a single query to list the states and then a query for each of the states to retrieve the cities. With 3 states that's not a huge deal, but if it were all 50 we'd be looking at 1 query, the +1, to get the states and 50, the n, to get all of the cities.

2N+1 (no that's not a thing)

Before we fix the N+1 queries I want to add an additional piece of information for each state, it's country. This one will go in the other direction on a one-to-many relationship. Each state will have exactly one country.

Alaska (United States)
...
...

class Country(models.Model):
    name = models.CharField(max_length=64)

class State(models.Model):
    name = models.CharField(max_length=64)
    country = models.ForeignKey(Country, related_name='states')

...
...
<li>{{ state.name }} ({{ state.country.name }})
...

If we take a look at the SQL tab of django-debug-toolbar we'll see there are now queries happening to pull in the country object for each state. Also note that we're retrieving the same country over and over, since it's shared by all of the states.

2N+1 queries, not good

Now we have a couple interesting problems to solve, conveniently one for each of Django ORM's primary solutions.

How Deep

What about when you want to span multiple levels? Both select_related and prefetch_related can traverse relationships using double-underscores. When you make use of the ability the intermediate objects will be included as well. This can be extremely powerful, but a little hard to grok in more complex object models.

# only works when there's a single object at each step
city = City.objects.select_related('state__country').all()[0]
# 1 query, no further db queries
print('{0} - {1} - {2}'.format(city.name, city.state.name,
                               city.state.country.name)

# works for both single and multiple object relationships
countries = Country.objects.prefetch_related('states__cities')
# 3 queries, no further db queries
for country in countries:
    for state in country.states:
        for city in state.cities:
            print('{0} - {1} - {2}'.format(city.name, city.state.name,
                                           city.state.country.name)
In code, tagged: db, examples, performance, and sql.

About the Author

Ross McFarland Ross McFarland | | |

Ross is a 13 year veteran of the software industry with experience spanning low-level signal processing, web and mobile user interfaces, and high-scale distributed web services including Amazon.com's Digital Media Group. He has made extensive contributions to open source highlighted by his time as a primary maintainer of Gtk2-Perl and author of requests-futures and python-asynchttp librarys. (more)


Comments