I used twitter4j and R to make a word map of Donald Trump’s tweets. I thought it would be interesting to see what his most used words are. The program downloads 3000 of his most resent tweets, unfortunately it cannot download all of the extended mode tweets. Only the first 140 characters. It wasn’t that interesting in the end.
The released Panama data comes in the form of a Neo4J database, or the files that you can make one with, seems to me a little tricky to do much with. There is no detail beyond attributes of the different entities, so that limits us to looking at the relationships alone and it is hard to judge the significance of the relationships without the context… that said its a fun data set to play with.
I decided to draw out some graphs of how things are connected via other things. Below is one from Officers connected to other Officers via *something* else, generated via R using iGraph from the Neo4J data set. This produces a few clusters containing a relatively small number of nodes connected to others. The query that produces the graph is, “MATCH (n:Officers)-[:`officer of`]->(o)<-[:`officer of`]-(m:Officers) WHERE NOT id(n)=id(m) AND id(n)<id(m) RETURN n.name AS Officer1, m.name AS Officer2, count(o) AS Weight”
After listening to the Daniel Morgan podcast, Untold, I became really interested in the murder investigation. To help me follow it I started building a network of all the key people, organisations, and events in the case. The networks this produces can be seen here,and you can keep up-to-date with the progress on the network here.
There is an updated network image here.
The story is a compelling one, I suggest you either listen to the podcast or read the book. Very briefly it looks into the murder of Daniel Morgan, and the subsequent investigations into the murder and the police handling of the murder. The book builds a compelling story of decades of struggle by the Morgan family to get justice, and the difficultly they have had in discovering the truth.
The network is not complete, at the time of writing I have only put in the ‘easy’ bits. The network stores objects as the nodes, so people, companies, organisations. The lines, or edges, store the relationship between the objects, e.g. Alistair Morgan is ‘brother_of’ Daniel Morgan. The visualisation is produced using Alchemy, and the data is stored in Neo4J. I intend to continue to develop the network further, and the visualisation which needs things like edge labels. Once the network is more complete it would be interesting to see if there is any useful analysis that can be done on the network. It would also be interesting to expand the data to include other related and interesting cases. Such as the Stephen Lawrence murder, and the Leveson Inquiry will likely form a part of Algorithmic Indexing in the future.
Here is a picture of the network in Neo4J:
Another highlight from 33C3 was Julia Reda’s talk about the proposed EU copyright law, Copywrongs 2.0. I say highlight, only because it was an interesting and compelling talk, the law itself is an absolute lowlight. To say that the proposed law is not fit for purpose is an understatement, and there is a question as to whether it is designed for purpose has less to do with protecting creators and more to do with protecting an industry struggling with an outdated business model.
The reform is a final parting shot by the outgoing EU commissioner Günther Oettinger. His proposed reform to EU copyright threatens freedom of expression by making simple things like linking to content (a central tenet of the the internet) a breach of copyright. This is obviously madness.
The proposals seems to be the product of some intensive lobbying by what are often referred to as ‘old media’. Some news publishers, mostly those who are struggling to adapt their business models to the 21st century, want to charge search engines and social networks for the links displayed in searches or embed in users posts. Essentially charging for the traffic sent their way. The other culprit is the music industry, struggling in the world of YouTube. Personally, I particularly don’t want to see the newspaper industrial disappear, especially in the world we live in today, but this isn’t the answer.
So what does the proposed law prohibit? As written sharing small sections of news articles e.g. on a blog or a personal website (such as this one) without a license from the publisher will be an infringement, for as long as 20 years after the article was originally published. This is crazy, the point of doing that is to drive traffic to the original story, the newspaper industry seems to be shooting itself in its foot.
As its stands the EU Commission has not proposed any exceptions based on the size of the snippet, or for individuals, or for non-commercial purposes, and providing a link to the source isn’t enough. This essentially means you have to have a license to reference or attribute a quote. What this means for newspapers quoting each other I don’t know, or for academic work.
Not only can you not link on social media, it would also seem that indexing the web in general would be impossible without licensing, and thus essentially impossible. In fact, any and every site in existence would have to ways of filtering out copyright infringements.
What about collaboration? The affect such a law would have on site that foster collaboration is also not clear, but likely to be bad. For example GitHub would have to put in place the filtering technology to search for source code that someone wants to keep of the site. Even if that code was written under some open source licenses. Also in trouble would be Wikipedia, and anyone using data from the web for training of AI or similar.
So what is Günther Oettinger trying to do? Does he just have no understanding of the internet, and it would seem copyright? He is known to be in favour of big business, and seems to be close to the publishing industry. At best its a misguided attempt at protecting an outmoded business model. What happens now is down to people doing a bit of lobby of our own. Is there any point in Brits getting involved? Yes, for one there is a chance that the UK will mirror some EU laws, at least initially and we don’t want this one. Also we can do our bit to help out our EU neighbours.
On of the most interesting and important projects reported on at the 33c3 was the Syrian Archive project. This is an immensely important project that is impart documenting the Syrian conflict, including the human cost, but is also trying to help work towards a lasting peace in Syria. A major component of this work involves the curation of documentary evidence.
This includes evidence gathering and documentation of incidents; the acknowledgement that war crimes and human rights violations have been committed by all sides; the identification of perpetrators to end the cycle of impunity and the development of a process of justice and reconciliation.
The project which started in 2014 collects data, often in the form of images or video, from citizen journalists on the ground in Syria. The goal being to create an evidence based tool that can be used by journalists, HRDs and lawyers. The collected data is then securely stored on backed up servers, reducing the potential for loss of evidence. The project also builds meta-data for the evidence, which is often lost (particularly if the video is uploaded to social media services which often strip out the meta-data). Meta-data is often extremely important for the verification of the evidence as it helps to locate an incident temporally and spatially.
They also work to ensure the integrity of the data, including by producing a hash code of the data at the point of upload. This ensures that the evidence cannot be tampered with at some later point. All this is done through a range of simple tools. The result is a verifiable, searchable, and secure data repository that is accessible to anyone. The archive also allows for evidence to be cross referenced across multiple sources, and multiple platforms, helping to verify the claims.
This work is of great value as often in wars all sides seek to hide the full extent of their impact on the civilian population. The database has already proved instrumental in determining the facts around an air strike that wrongly hit a Mosque in Syria. Claims and counter claims cast doubt of the real events, with the Russian ministry of defence claiming that the Mosque was still intact, but witnesses claiming it had been destroyed. The data set allowed investigators to verify that a Mosque had been hit, and only that the name of the Mosque was incorrectly reported, leading to the confusion. Both the actual incident, and the claimed incident, can both be recorded in the database. The archive also allows the use of tactics or weapons to be tracked across multiple events, such as the use of chemical weapons.
The openness is key to this project, and links with some of my own research. We live in a world where different interested parties will make claims and counter claims about news or events. This makes it hard to determine which claim is best supported by the evidence on the ground. What this archive, and others like it, do is allow anyone to make an assessment of the evidence available, perhaps enabling them to understand the events in question better.
The talk was presented by Jeff Deutch and Hadi Al-Khatib, thanks to them for letting look at the slides again for reference. The videoed talk is linked below.
The people over at The International Consortium of Investigative Journalists have updated the released panama data. Its not clear to me if that is more data than they had already released, or that this time it is a ready made Neo4J database. They provide two versions of the database, Windows and Mac. Its easy to get it to work in Linux, just copy the graph.db file from out of the archive into the databases directory of your Neo4J install.
I made a quick query to look for officers with the same address. Seems there some, it would need something more sophisticated to did any deeper.
MATCH (n:Officer)–(a:Address)–(m:Officer) RETURN n,a,m LIMIT 25