The backend workings of Politwitter have changed and evolved since its launch in April 2009. The API's that Politwitter uses to aggregate data are always changing so its always bit of work keeping up with these changes. Also with the ever growing amount of data and the increase in volumne I've had to change things to handle the capacity over this time.
Politwiiter doesn't load any of the social media data like tweets & Facebook posts on demand, this would be very slow and with things like Twitter it would very quickly reach their "rate limits". So I store all of this content locally in the politwitter database and this content is fetched by background processes, so the frontend pageloads never have to wait to fetch content. This allows me to also generate all the fun stats on this data and provide a permanent historical archive of elected officials social media usage.
At first I used the Twitter REST API to fetch tweets with backcound processes that ran ever couple minutes. It was always a challenge dealing with the Twitter rate limiting, so in the fall of 2009 I rewrote the backend to use the new Twitter Streaming API Alpha. Since then Politwitter gets 99% of the Twitter data in realtime using the streaming API and rate-limits are not an issue. But I still use the REST API to do "backup scrapes" to get any tweets that might have been missed by the streaming API, or if the streaming process crashes for some reason I can still get all the missed tweets for the archive.
Here is some short video of the 2 background streaming processes running and one REST scraper process running.
Politwitter displays a lot of content & statistics on every page and has many detailed statistics pages. Since the size of the database has grown so large, most of this data is processed in the background on a periodic bases rather than on demand.
For example the "Todays Federal Stats" block on the homepage, this data takes a long time to generate and would cause the pageload times to be greater than 1 minute. So a background process runs every 10 minutes that generates this data and is stored in the database. This data is stored over time so trends can be produced or historical stats retreived. This is the case for a lot of the stats like follower counts or sentiment.
Content that isn't generated in the backcound is still almost always cached, there are over 20,000 cache files stored. Almost all the "side blocks" are cached every 5-10 minutes and some pages like statisitc pages are fully cached.
All of the other social media content besides Twitter is fetched on a periodic bases as they don't have an equivalent "Streaming API". For example Facebook posts are fetched every hour. There are a lot of these background processes running 24/7 to keep Politwitter updated, generating stats and keeping pageload times to a minimum.
I have a backend page to monitor how long these various background processes are taking to run, to watch for potential problems.
Even with all of this background pre-processing and caching there are still many pages on Politwitter that are slow to load and the site can slow down under high traffic. I have done a lot of optimizing of database queries but the size of the tweets database has just grown so large its always a challenge.
The tweets database is now over 5.3 million tweets!
The major issue I have with the database is with "table locking". The main tweets mysql table is MyISAM format which makes most of the read queries fast and the writing very fast. But the major problem is MyISAM locks the table during a query, so a backlog of queries can form if a complex query takes a long time. If a page is taking a long time to load on Politwitter this is most likely the reason.
In the winter of 2011 I greatly helped this situation by adding a 2nd staging tweets table that the streaming API writes to, then a 2nd process grabs tweets from this staging table and moves them to the main tweets table afterwards. This allows for high volumn of incoming tweets to not slowdown the site and seperates the reading from writing, greatly reducing the table locking. This also means the streaming API won't miss tweets due to not being able to write a tweet due to table locking. I've tested over 3,000 tweets a minute during the Royal Wedding test.
But since the main tweets table is getting so massive some of the read queries take longer to execute and table locking is often an issue. Especialy when bots from Google or Bing crawl the site to weird urls that produce some complex queries. I've tried to reduce some of these queries and block bots from some of these pages, but table locking is still my evil nemesis.
I have investigated changing the tweets table to InnoDB format which has "row level locking" that would potentialy resolve the table locking issue, but its not that simple. The overall performance of InnoDB compared to MyISAM in this use-case is slower and many of the queries I use are VERY slow in InnoDB. Queries with functions like COUNT() are deadly slow in InnoDB and many of the index optimizations i've made in MyISAM work differently in Innodb.
I'm planning on taking another look at converting to Innodb, which basically means I have to check every query on the site and rewrite or find alternate methods for many. This is very time consuming, but its something that will probably have to happen as the database continues to grow and traffic to the website continues to grow. There are over 2,000 average pageviews per day.
I have considered looking at other database solutions like MongoDB, but this would require major reworking of the Politwititer code base. But if InnoDB turns out to not be viable, this may be the next step.
If you want to help with the continued development & maintainance of Politwiiter you can donate here.
This action requires you to be logged into Politwitter. No regisrtation is required, just authenticate using your Twitter account.