Welcome to SMERSH: Building a New, Better, Election Forecasting Model
Regular PBer’s will know that five years ago I built VIPA, a model that attempted to look at proportionate swings on a party-by-party basis to model results. This model was – to be quite frank – stolen by Nate Silver, and he used it with much bally-ho. While VIPA did a better job of predicting the 2010 election than UNS (it was noticeably more pessimistic on the LibDems than UNS, for example), it was not perfect.
And so, 11 months away from the 2015, I’ve created a wholly new model for predicting elections.
This post is the preview: essentially it will describe how Smithson’s Marvellous Election Results Heuristics System (or SMERSH) works. You’ll have to wait to see the pieces over the next few weeks to see what SMERSH says about the prospects for the Conservatives, Labour, the LibDems and UKIP.
The basic principle of SMERSH is that UNS doesn’t work well when parties either come from nowhere (like UKIP) or look like they’re going to lose half their support or more (the LibDems). To take the LibDems, you cannot simply subtract 14% from each constituency in the country to see the share, because that would end up with a large number of constituencies where the LibDems have negative vote shares. Something which, I am led to believe, is not actually possible.
SMERSH version 1.0 (which was built to model LibDem losses more than anything else) basically said:
For every vote that a party loses, let us chose completely randomly which seat in the UK (ex-NI) it is lost from – with the important note being that seats with more votes are more likely to have votes taken away with seats with few votes. See this (incredibly simplified) example:
Seat 1: 200 votes
Seat 2: 100 votes
Seat 3: 500 votes
Seat 4: 50 votes
Seat 5: 150 votes
Total: 1,000
Now SMERSH says: Party A has lost 20% of its support (200 votes). So, 200 times we must roll the dice to see where the vote is lost. So, if the first time it say 131, then the vote goes from Seat 1. The second time its 422, and therefore seat 3 loses the vote. Etc.
This – in essence – results in a proportionate swing model, which didn’t unfortunately, match reality as well as I would have liked.
So, I sat down, and decided that we needed to do seat by seat modelling based around the principles of SMERSH 1.0 to calculate curves that looked (roughly) like this:
The way SMERSH 2.0 worked was to take the data I did have: i.e. vote shares by constituency for 2001, 2005 and 2010, plus local election results in 2011, 2012, 2013 and 2014. It would then programatically place these, and then attempt to fit a curve around the data points I knew.
This almost worked: it failed completely in places where there had been counter national swings, because my line fitting algorithim genuinely believed that in certain places (such as the LibDems in Oxford West and Abingdon) there was an inverse relationship between national and local vote share.
So, now we come to SMERSH 3.0. This is a cruder method: essentially a giant Monte Carlo simulation that assumes the line must always have a positive gradient, and which looks to minimize deviance from known points (i.e. three sets of general election results, the last four years local election results, and the 2014 European election results). My model forecasts absolute number of votes per party on a constituency basis, which does lead to some slightly anomalous turnout results in places where UKIP is strong (and conversely, where UKIP is weak). The end result is forecasts for every party, in every seat in the UK (ex-NI), which: (a) add up to the national share, (b) reflect areas of local strength and weakness, and (c) give a joyously different set of results to traditional models.
Over the next few weeks I’m going to drip feed you guys the results of SMERSH 3.0. (And Nate Silver, please don’t steal my idea again. Please.)
I will also – when I’ve cleaned up the code considerably (which may be some time) – release the Python source code of my simulation. (As an aside: this is seriously computationally intensive. Running my simulation on my home computer took about 50-60 hours to run just a few hundred consistent and correct Monte Carlo results. Big kudos to PythonAnywhere which allowed me to harness the equivalent of a 100 servers to get results back before the 2015 elections.)