Django/Postgres: Optimize COUNT(*) by replacing with an estimate

Every week I take a look at a report generated by the excellent pgFouine tool for our production Postgres database. This week, there was a surprising query on the top of the slow runners list:

SELECT COUNT(*) FROM website_job;

This query was taking an average of 3 seconds! Granted, this is a table with 500,000 rows in it, but still. Naively, you might expect that Postgres could answer this directly out of some kind of meta data about the table. However, slow count operations are a well known issue.

PostgreSQL must walk through all rows, in some sense. This normally results in a sequential scan reading information about every row in the table.

It turns out that Postgres can quickly count result sets that use an index, such as:

-- assuming is_open is indexed
SELECT COUNT(*) FROM website_job WHERE is_open=1;

But what about the vanilla case? The documentation suggests that you keep track of the count of records yourself, either with triggers or in your application code. However, they also throw out some SQL that can be used to estimate the number of records in a table. For many purposes, that may be sufficient. Here is a Django helper method that does just that:

from django.db import connection

def estimate_count_fast(type):
    ''' postgres really sucks at full table counts, this is a faster version
    see: http://wiki.postgresql.org/wiki/Slow_Counting '''
    cursor = connection.cursor()
    cursor.execute("select reltuples from pg_class where relname='website_%s';" % type)
    row = cursor.fetchone()
    return int(row[0])

>estimate_count_fast('job')
512387


I'm currently working at NerdWallet, a startup in San Francisco trying to bring clarity to all of life's financial decisions. We're hiring like crazy. Hit me up on Twitter, I would love to talk.

Follow @chase_seibert on Twitter