Ngrams – Computational Analysis of Google Ngrams Data

I recently did an invited talk for the student linguistics group at York St John University in York. The paper was broadly on the analysis of Ngrams data and was therefore a something of a summary of about 5 papers that I have been involved with over a number of years. It was nice to do a sort of overview of a fairly long project, and I was able to give the students a sense of how a program of research evolves as you get further into the empirical evidence and have to construct and test theories about what it is that you think is going on.

The work is an analysis of Google Ngrams data, where we attempt to explain the observed changes in word frequencies. Including the links between events and the words that we use in our language. We also investigated the explanatory power of the neutral model, and how well it fits changing patterns of word frequencies. I have linked the slides below so you can see what was presented, and the references are below that show the evolution of the work. I think I will do a podcast of this work in the future as I think its an interesting story.

The Expression of Emotions in 20th Century Books

A new paper is out (PLoS One so free to all), lead by Alberto Acerbi (Bristol Uni), and co-authored by Vasileios Lampos (Sheffield uni), myself (Durham Uni) and R. Alexander Bentley (Bristol Uni). Its a really fun paper looking at the changing pattern in the use of emotion words in the English language during the 20th Century. We make use of Google’s Ngram data. Google scanned approximately 4% of all books and generated a dataset of yearly world frequencies. We mined this dataset to extract the changing frequencies of emotion words throughout the 20th century.

In the data we can see the frequency of words expressing emotions such as anger, fear, joy, sadness, and disgust changing in line with historical events. Large social/cultural events like the World War II, the roaring 20s and the swinging 60s all show up as frequencies changes of words. Interestingly the World War I doesn’t seem to appear in the data, however the Great Depression in the 1930s does. We also expected, due largely to cultural stereo typing, that US books would be more emotional that UK. This is supported by the data, but the split occurs much more recently than we thought it might.  Generally throughout the 20th century the frequency of emotion words has been declining, with one exception, fear. Could that be linked to the climate of fear that has developed during the latter half of the 20th century?

Figure 2. Decrease in the use of emotion-related words through time.
Difference between -scores of the six emotions and of a random sample of stems (see Methods) for years from 1900 to 2000 (raw data and smoothed trend). Red: the trend for Fear (raw data and smoothed trend), the emotion with the highest final value. Blue: the trend for Disgust (raw data and smoothed trend), the emotion with the lowest final value. Values are smoothed using Friedman’s ‘super smoother’ through R function supsmu().

The paper has been really well received in the media, Alberto was interviewed for BBC Radio 4s Material World by Adam Rutherford. Alex and myself were interviewed for NPR.


Word Diffusion in Climate Science

Our new data mining and modelling paper is out today, “Word Diffusion in Climate Science“. Investigating the diffusion of climate science words in the Google ngrams dataset. We make observation that there is often a disjoint between the findings of science and the impact it has in the public domain. This existence of a disjoint is particularly significant when it is important the science reaches the public. Our hypothesis is that important keywords used in the climate science discourse follow “boom and bust” fashion cycles in public usage. If these cycles are linked to the science leaving the public eye then perhaps scientist need to think about they can do to ensure important findings reach as many people as possible.

Durham university press release (including a rather-too-big-for-my-liking picture of me).