Website analytics tools are expected to do a lot. They need to allow you generate reports on metrics like sessions, pages/time per session, bounce rates, referrals... etc. Not only that, but they need to be able to segment those reports by logged in versus anonymous users, china versus the US, etc...

As a developer, the more you play with existing analytics tools, the more you appreciate not having to implement all this crap yourself. Just recently I was trying to track hits from social networks. My application would blast a tweet out to twitter, and then try to determine how many hits were coming back.

Determining raw hits is easy. But I had no idea how many automated bots would hit the URLS I was posting. I had nearly 1,2000 hits in just 24 hours to the same post! Here is a short list of bot user agents collected in just one day.

  • LinkedInBot
  • TwitterJobSearch.com
  • PycURL
  • labs.topsy.com
  • postrank.com
  • voyager/1.0
  • JS-Kit URL Resolver
  • Twitterbot
  • mxbot
  • urllib
  • ysearch
  • Twingly
  • TweetmemeBot
  • OneRiot
  • Googlebot
  • inagist.com
  • Jakarta Commons
  • facebookexternalhit
  • NjuiceBot
  • Yahoo! Slurp

Needless to say, tying to determine "real" hits by user agent was going to be both arduous and error prone.

Enter Google Analytics, the gold standard of web analytics packages. My marketing department had already requesting we include it into the application anyway.

It works via JavaScript, which is perfect for thwarting bots. The bots in question don't appear to execute JavaScript when they scrape a page, or if they do, Google Analytics is going above and beyond to filter them out.

<script type="text/javascript">

 var _gaq = _gaq || [];
 _gaq.push(['_setAccount', 'UA-XXXXXXX-X']);

 // custom variables
 _gaq.push(['_setCustomVar', 1, 'Current User', 'joe.example', 1]);
 _gaq.push(['_setCustomVar', 2, 'Current User ID', '75', 1]);
 _gaq.push(['_setCustomVar', 3, 'Record Owner ID', '101', 3]);
 _gaq.push(['_setCustomVar', 4, 'Record Type', 'Job', 3]);
 _gaq.push(['_setCustomVar', 5, 'Record ID', '87', 3]);
 _gaq.push(['_trackPageview']);

 (function() {
  var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
  ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
  var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
 })();

</script>

As you can seen, as long as I was including Google Analytics, I decided to go all out and leverage their custom variables. Basically, that allows you to slice and dice all their reports by your own custom segments. In this case, I'm allowing our marketing team to report on metrics by user and page type.

But I also needed to include some metrics right in the product itself. To my delight, it's not only possible to pull down analytics data via the Google Analytics API, it's also surprisingly easy thanks to an official Python library. Not only that, but the API is extremely flexible.

from django.conf import settings

import gdata.analytics.client
from gdata.sample_util import CLIENT_LOGIN, SettingsUtil

gdata_client = gdata.analytics.client.AnalyticsClient(
    source=settings.GOOGLE_ANALYTICS_APP_NAME
    )

def _login():

    settings_util = SettingsUtil(prefs={
        "email": settings.GOOGLE_ANALYTICS_USER_EMAIL,
        "password": settings.GOOGLE_ANALYTICS_USER_PASS,
    })
    settings_util.authorize_client(
        gdata_client,
        service=gdata_client.auth_service,
        auth_type=CLIENT_LOGIN,
        source=settings.GOOGLE_ANALYTICS_APP_NAME,
        scopes=['https://www.google.com/analytics/feeds/']
        )

def get_views(year, week):

    _login()
    data_query = gdata.analytics.client.DataFeedQuery({
        'ids': settings.GOOGLE_ANALYTICS_TABLE_ID,
        'start-date': '2010-10-01',
        'end-date': '2100-01-01',
        'dimensions': 'ga:customVarValue3,ga:customVarValue4,ga:week',
        'metrics': 'ga:pageviews',
        'filters': 'ga:customVarValue4==Job,ga:customVarValue4==Profile;ga:week==%s;ga:year==%s' % (week, year),
        'max-results': "10000"
        })

    return gdata_client.GetDataFeed(data_query)

Here, I'm getting page view data for October, and creating a pivot table right in the API results by record owner and record type (either Job or Profile). I'm also grouping the results by week. The API will return up to 10,000 results, but I could easily break it into smaller chunks if necessary.

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xmlns:dxp='http://schemas.google.com/analytics/2009' xmlns:openSearch='http://a9.com/-/spec/opensearch/1.1/' xmlns:gd='http://schemas.google.com/g/2005' gd:etag='W/&quot;D04HQn4_fCp7I2A9Wx5VFUs.&quot;' gd:kind='analytics#data'>
 <id>http://www.google.com/analytics/feeds/data?ids=ga:35391211&amp;dimensions=ga:customVarValue3,ga:customVarValue4,ga:week&amp;metrics=ga:visits&amp;filters=ga:customVarValue4%3D%3DJob,ga:customVarValue4%3D%3DProfile&amp;start-date=2010-10-01&amp;end-date=2010-10-31</id>
 <updated>2010-10-08T12:25:33.044-07:00</updated>
 <title>Google Analytics Data for Profile XXXXXXXX</title>
 <link rel='self' type='application/atom+xml' href='https://www.google.com/analytics/feeds/data?max-results=5&amp;sort=-ga%3Avisits&amp;end-date=2010-10-31&amp;start-date=2010-10-01&amp;metrics=ga%3Avisits&amp;ids=ga%3A35391211&amp;dimensions=ga%3AcustomVarValue3%2Cga%3AcustomVarValue4%2Cga%3Aweek&amp;filters=ga%3AcustomVarValue4%3D%3DJob%2Cga%3AcustomVarValue4%3D%3DProfile'/>
 <link rel='next' type='application/atom+xml' href='https://www.google.com/analytics/feeds/data?start-index=6&amp;max-results=5&amp;sort=-ga%3Avisits&amp;end-date=2010-10-31&amp;start-date=2010-10-01&amp;metrics=ga%3Avisits&amp;ids=ga%3A35391211&amp;dimensions=ga%3AcustomVarValue3%2Cga%3AcustomVarValue4%2Cga%3Aweek&amp;filters=ga%3AcustomVarValue4%3D%3DJob%2Cga%3AcustomVarValue4%3D%3DProfile'/>
 <author>
  <name>Google Analytics</name>
 </author>
 <generator version='1.0'>Google Analytics</generator>
 <openSearch:totalResults>11</openSearch:totalResults>
 <openSearch:startIndex>1</openSearch:startIndex>
 <openSearch:itemsPerPage>5</openSearch:itemsPerPage>
 <dxp:aggregates>
  <dxp:metric confidenceInterval='0.0' name='ga:visits' type='integer' value='35'/>
 </dxp:aggregates>
 <dxp:dataSource>
  <dxp:property name='ga:profileId' value='XXXXXXX'/>
  <dxp:property name='ga:webPropertyId' value='UA-XXXXXXXXX'/>
  <dxp:property name='ga:accountName' value='Talladega Alpha'/>
  <dxp:tableId>ga:XXXXXXX</dxp:tableId>
  <dxp:tableName>reachrecruiters.com</dxp:tableName>
 </dxp:dataSource>
 <dxp:endDate>2010-10-31</dxp:endDate>
 <dxp:startDate>2010-10-01</dxp:startDate>
 <entry gd:etag='W/&quot;CUUEQX47eSp7I2A9Wx5bFEU.&quot;' gd:kind='analytics#datarow'>
  <id>http://www.google.com/analytics/feeds/data?ids=ga:35391211&amp;ga:customVarValue3=137&amp;ga:customVarValue4=Job&amp;ga:week=41&amp;filters=ga:customVarValue4%3D%3DJob,ga:customVarValue4%3D%3DProfile&amp;start-date=2010-10-01&amp;end-date=2010-10-31</id>
  <updated>2010-10-30T17:00:00.001-07:00</updated>
  <title>ga:customVarValue3=137 | ga:customVarValue4=Job | ga:week=41</title>
  <link rel='alternate' type='text/html' href='http://www.google.com/analytics'/>
  <dxp:dimension name='ga:customVarValue3' value='137'/>
  <dxp:dimension name='ga:customVarValue4' value='Job'/>
  <dxp:dimension name='ga:week' value='41'/>
  <dxp:metric confidenceInterval='0.0' name='ga:visits' type='integer' value='15'/>
 </entry>
 <entry gd:etag='W/&quot;CUUEQX47eSp7I2A9Wx5bFEU.&quot;' gd:kind='analytics#datarow'>
  <id>http://www.google.com/analytics/feeds/data?ids=ga:35391211&amp;ga:customVarValue3=138&amp;ga:customVarValue4=Job&amp;ga:week=41&amp;filters=ga:customVarValue4%3D%3DJob,ga:customVarValue4%3D%3DProfile&amp;start-date=2010-10-01&amp;end-date=2010-10-31</id>
  <updated>2010-10-30T17:00:00.001-07:00</updated>
  <title>ga:customVarValue3=138 | ga:customVarValue4=Job | ga:week=41</title>
  <link rel='alternate' type='text/html' href='http://www.google.com/analytics'/>
  <dxp:dimension name='ga:customVarValue3' value='138'/>
  <dxp:dimension name='ga:customVarValue4' value='Job'/>
  <dxp:dimension name='ga:week' value='41'/>
  <dxp:metric confidenceInterval='0.0' name='ga:visits' type='integer' value='7'/>
 </entry>
 <entry gd:etag='W/&quot;CUUEQX47eSp7I2A9Wx5bFEU.&quot;' gd:kind='analytics#datarow'>
  <id>http://www.google.com/analytics/feeds/data?ids=ga:35391211&amp;ga:customVarValue3=138&amp;ga:customVarValue4=Profile&amp;ga:week=41&amp;filters=ga:customVarValue4%3D%3DJob,ga:customVarValue4%3D%3DProfile&amp;start-date=2010-10-01&amp;end-date=2010-10-31</id>
  <updated>2010-10-30T17:00:00.001-07:00</updated>
  <title>ga:customVarValue3=138 | ga:customVarValue4=Profile | ga:week=41</title>
  <link rel='alternate' type='text/html' href='http://www.google.com/analytics'/>
  <dxp:dimension name='ga:customVarValue3' value='138'/>
  <dxp:dimension name='ga:customVarValue4' value='Profile'/>
  <dxp:dimension name='ga:week' value='41'/>
  <dxp:metric confidenceInterval='0.0' name='ga:visits' type='integer' value='3'/>
 </entry>
 <entry gd:etag='W/&quot;CUUEQX47eSp7I2A9Wx5bFEU.&quot;' gd:kind='analytics#datarow'>
  <id>http://www.google.com/analytics/feeds/data?ids=ga:35391211&amp;ga:customVarValue3=119&amp;ga:customVarValue4=Job&amp;ga:week=41&amp;filters=ga:customVarValue4%3D%3DJob,ga:customVarValue4%3D%3DProfile&amp;start-date=2010-10-01&amp;end-date=2010-10-31</id>
  <updated>2010-10-30T17:00:00.001-07:00</updated>
  <title>ga:customVarValue3=119 | ga:customVarValue4=Job | ga:week=41</title>
  <link rel='alternate' type='text/html' href='http://www.google.com/analytics'/>
  <dxp:dimension name='ga:customVarValue3' value='119'/>
  <dxp:dimension name='ga:customVarValue4' value='Job'/>
  <dxp:dimension name='ga:week' value='41'/>
  <dxp:metric confidenceInterval='0.0' name='ga:visits' type='integer' value='2'/>
 </entry>
 <entry gd:etag='W/&quot;CUUEQX47eSp7I2A9Wx5bFEU.&quot;' gd:kind='analytics#datarow'>
  <id>http://www.google.com/analytics/feeds/data?ids=ga:35391211&amp;ga:customVarValue3=44&amp;ga:customVarValue4=Job&amp;ga:week=41&amp;filters=ga:customVarValue4%3D%3DJob,ga:customVarValue4%3D%3DProfile&amp;start-date=2010-10-01&amp;end-date=2010-10-31</id>
  <updated>2010-10-30T17:00:00.001-07:00</updated>
  <title>ga:customVarValue3=44 | ga:customVarValue4=Job | ga:week=41</title>
  <link rel='alternate' type='text/html' href='http://www.google.com/analytics'/>
  <dxp:dimension name='ga:customVarValue3' value='44'/>
  <dxp:dimension name='ga:customVarValue4' value='Job'/>
  <dxp:dimension name='ga:week' value='41'/>
  <dxp:metric confidenceInterval='0.0' name='ga:visits' type='integer' value='2'/>
 </entry>
</feed>

The gdata library parses this for you into native nested dictionaries. From there, it's trivial to cache it somewhere like a database for future reporting.