Ngrams – Computational Analysis of Google Ngrams Data

I recently did an invited talk for the student linguistics group at York St John University in York. The paper was broadly on the analysis of Ngrams data and was therefore a something of a summary of about 5 papers that I have been involved with over a number of years. It was nice to do a sort of overview of a fairly long project, and I was able to give the students a sense of how a program of research evolves as you get further into the empirical evidence and have to construct and test theories about what it is that you think is going on.

The work is an analysis of Google Ngrams data, where we attempt to explain the observed changes in word frequencies. Including the links between events and the words that we use in our language. We also investigated the explanatory power of the neutral model, and how well it fits changing patterns of word frequencies. I have linked the slides below so you can see what was presented, and the references are below that show the evolution of the work. I think I will do a podcast of this work in the future as I think its an interesting story.

Chip – the Algorithmic Savings App

I came across an App, while aimlessly surfing, it’s an algorithmic savings plan. I think that is the best way to describe it, or at least its a way to describe it. The point of Chip is that when you sign up you get a savings account, held by Barclays Bank PLC, and the app figures out how much money you could save (and not miss), and when. So every now and then Chip determines, via the magic of algorithms, an amount of money that you could save and not miss too much.

On the default savings rate it seems to be similar to the cost of a large latte and a chocolate bar. Chip then congratulates you your saving, you can back out if money is short. If you leave it to its own devices that sum of money disappears from your nominated current account to reappears in your new savings account.

Am I Using it?

Yes, I have signed up for the app. I thought that in general this isn’t a bad way to save. I have standing orders for saving a modest sum of money every month but I always thought I could do a little more. What Chip does is allow that to happen in a flexible way, no need to commit to a particular amount at the start of each month, and no need need to remember into get on internet banking to do it manually. Chip does it for you, and if you are a bit short one month you can stop the transfer. Great if your income is irregular and saving a fixed sum might be tricky.

Similarly, should you suddenly find you need money, you can easily get at the funds out of the savings account. This in my mind this makes this a sort of slush fund. Which you can dip into should need to, or are tempted to. I still think longer term savings are also a good idea. Putting a little away somewhere harder to get at, and also make sure you have a pension as soon as you can!

Saving Made Entertaining?

Chip makes saving about as entertaining as it probably could be. You get congratulatory memes when you save, and Chip is well chipper, and encourages you along your savings journey. The chipperness might drive some users slightly mad, but I think they got the balance about right. It does seem to work, or at least it does for me, after 103 days using the app I have saved slightly over £200. Which, although not a massive sum, is £200 more than I otherwise would have. I set a goal, rather arbitrarily, of £1500. Weirdly Chip seems to report that I am always about 95 weeks away from my goal… but whatever, I can see the amount saved go up and the amount left go down. Thats progress.

What About My Data?

In order to do all this Chip needs read-only access to your bank account. Now that is not data that should be handed over lightly. Sure your bank knows it, but your day to day transactions is very personal data. It provides a lot of information about how and where you spend you money, and thus who you are in a way. Chip is regulated by the ICO and they encrypt the data.

Chip has a data control licence – you’ll find us on the ICO register – and we always act in full compliance with the Data Protection Act. Your online banking login details are protected using 256-bit encryption and Chip does not store your data.

Chip FAQ

This was the part of the process that made me wince a little. However, if they are going to calculate a savings rate then they need (at least some of) this information. So, if you want in, this is the price you pay. I wanted to have a more detailed look at what and how they use my data. So I asked them a few questions, but they are yet to reply…

Donald Trump Twitter Word Map

I used twitter4j and R to make a word map of Donald Trump’s tweets. I thought it would be interesting to see what his most used words are. The program downloads 3000 of his most resent tweets, unfortunately it cannot download all of the extended mode tweets. Only the first 140 characters. It wasn’t that interesting in the end.

Trump word cloud.

Updated Daniel Morgan Network

I have processed more of the Daniel Morgan data, and thus have an updated network of the data. Below is a visualisation of the data produced by extracting the network structure from Neo4J using R and iGraph, then saving the network as a gexf file and importing into Gephi. The network is more complete but also has edge labels.

Daniel Morgan murder data
Updated version of the Daniel Morgan data.

33c3 – Syrian Archive

On of the most interesting and important projects reported on at the 33c3 was the Syrian Archive project. This is an immensely important project that is impart documenting the Syrian conflict, including the human cost, but is also trying to help work towards a lasting peace in Syria. A major component of this work involves the curation of documentary evidence.

This includes evidence gathering and documentation of incidents; the acknowledgement that war crimes and human rights violations have been committed by all sides; the identification of perpetrators to end the cycle of impunity and the development of a process of justice and reconciliation.

syrianarchive.org

The project which started in 2014 collects data, often in the form of images or video, from citizen journalists on the ground in Syria. The goal being to create an evidence based tool that can be used by journalists, HRDs and lawyers. The collected data is then securely stored on backed up servers, reducing the potential for loss of evidence. The project also builds meta-data for the evidence, which is often lost (particularly if the video is uploaded to social media services which often strip out the meta-data). Meta-data is often extremely important for the verification of the evidence as it helps to locate an incident temporally and spatially.

They also work to ensure the integrity of the data, including by producing a hash code of the data at the point of upload. This ensures that the evidence cannot be tampered with at some later point. All this is done through a range of simple tools. The result is a verifiable, searchable, and secure data repository that is accessible to anyone. The archive also allows for evidence to be cross referenced across multiple sources, and multiple platforms, helping to verify the claims.

This work is of great value as often in wars all sides seek to hide the full extent of their impact on the civilian population. The database has already proved instrumental in determining the facts around an air strike that wrongly hit a Mosque in Syria. Claims and counter claims cast doubt of the real events, with the Russian ministry of defence claiming that the Mosque was still intact, but witnesses claiming it had been destroyed. The data set allowed investigators to verify that a Mosque had been hit, and only that the name of the Mosque was incorrectly reported, leading to the confusion. Both the actual incident, and the claimed incident, can both be recorded in the database. The archive also allows the use of tactics or weapons to be tracked across multiple events, such as the use of chemical weapons.

The openness is key to this project, and links with some of my own research. We live in a world where different interested parties will make claims and counter claims about news or events. This makes it hard to determine which claim is best supported by the evidence on the ground. What this archive, and others like it, do is allow anyone to make an assessment of the evidence available, perhaps enabling them to understand the events in question better.

The talk was presented by Jeff Deutch and Hadi Al-Khatib, thanks to them for letting look at the slides again for reference. The videoed talk is linked below.