18 Dec Blog: The hidden costs of computational approaches
A little more than three years ago Ridho Reinanda joined KITLV as a researcher in the Elite Network Shifts project. Since then he found several things to be done that are unavoidable for applying computational approaches in the humanities and social sciences domain. Of course they are not exhaustive and mainly apply to projects that involve some natural language processing.
My role, as a resident computer scientist is to develop automated methods for working with a collection of newspaper articles. Although I have worked in various domains before, I had little experience in humanities or social sciences. In the course of time, I have observed three things that are important to keep in mind when applying computational approaches to humanities and social sciences.
The first is data preparation. To apply a method, the data need to be prepared in a particular format. Since we were working with unstructured new articles, we therefore had to clean the article, identify segmentations in the pages and extract titles, leads, and main text. This preparatory step is important because it will influence the range of analysis that can be performed, for example: analysis at the document level or sentence level.
The next point deals with annotations. Machine learning can help to automatically identify people who are the keypersons or ‘talking heads’ in an article. This requires annotations for the ‘training‘ of the computational model. The quality of automated methods will depend on the ‘training data’. In some cases one can work with publicly available data and annotations. We were working on a new task for which these data and annotations were not available (Bahasa Indonesia in addition to English). We had to generate a reasonable number of annotations by ourselves, which requires a substantial amount of manual work. Mostly this is not foreseen because most people expect that the computer will learn by itself.
The third, last and maybe most interesting observation deals with ‘limitations of methods’. Social science and humanities strive for a 100% accuracy of data. Understandable if you want to read all the articles there are. In the computational approaches however we usually don’t work with 100% accuracy. This is not a problem because our samples are so large that 100% accuracy becomes irrelevant. My humanities colleagues started to question how the limitation of accuracy would affect their analysis.
Especially this last observation showed that there still was a distance between the humanities and computer science. We had to make important decisions about what limitations of the computational methods were comfortable for all of us.
In our project we made our choices and we will present them in conferences to come. What I have learned from these observations is that I am optimistic about the potential collaborations between computer scientists and humanities and social science researchers. Collaboration can be fruitful for both. The Elite Network Shifts project gave us a lot of lessons learned.