15 Aug Peasants don’t blog, but does it matter? (blog by Jacqueline Hicks)
Whenever I see another piece of research using social media data, I must admit I roll my eyes. Social media will only ever capture the voices of the urban young – the type of people who have less difficulty in getting their voices heard than the poor and relatively less educated rural population. The shorthand way I use to express this is “peasants don’t blog” and I fret about the future of the social sciences where funding and focus will increasingly leave out the voices of this already disenfranchised majority.
Having followed this year’s Indonesian presidential elections with much interest, I came across an Indonesian company called Politicawave which monitors several kinds of social media – Facebook, Twitter, blogs, forums, comments on news sites and YouTube. Their main customers are probably businesses wanting to know about how they are perceived online, but the company has also predicted several local and national elections with amazing accuracy. Three days before the presidential elections, PoliticaWave predicted the Jokowi victory to within 0.8% of the final result. Similarly, a few days before the Jakarta Governor election in September 2013, their prediction was within just over 1% of the result.
How is this possible, given the large demographic bias of those who use social media?
When I looked for others doing election predictions with social media data in other countries, I came across a review article comparing research. I was struck by one passage:
the authors are not aware of any publications or claims that, using social media data, someone was able to propose a method that would predict correctly and consistently the results of elections before the elections happened. What has happened, however, is that on several occasions, post processing of social media data has resulted in claims that they might had been able to make correct electoral predictions. 
Politicawave looks to have cracked it. They have trained their method on 12 regional elections so far, as well as the April 2014 legislative elections, and although there have been a few misses, their predictions remain very accurate.
They have several measures: “share of awareness”, “share of exposure” and “share of netizen” which are based on things like the number of times a candidate is mentioned and a technique known in natural language processing as sentiment analysis (searching for positive or negative words around a candidate’s name). But it’s unclear to me how they use these measures to obtain their predictions of results – remember, they are not just predicting who will win, but the exact percentage of a win.
It may be they found that one of these measures tends to correspond to the actual final results, but the news stories about their predictions all cite the use of different measures in different elections. It’s a bit of a mystery, I emailed Politicawave for clarification but received no answer.
To be sure, election prediction is among the least complex things that a social scientist could hope to glean from social media data. Nevertheless, it remains an intriguing proposition that the demographic bias of social media could be adjusted for whole populations in other types of research using such data, if the right proxy measure can be found.
 Metaxas, P., Mustafarai, En. and Gayo_Avello, D. (2011). ‘How (Not) to Predict Elections.’ SocialCom/PASSAT 2011.