Computing Veracity – the Fourth Challenge of Big Data


Pheme builds technology for finding how true claims made online are.

Why is this important?

The London Eye was on fire during the 2011 England riots! Or was it? Social networks are rife with lies and deception, half-truths and facts. But irrespective of a meme’s truthfulness, the rapid spread of such information through social networks and other online media can have immediate and far-reaching consequences. In such cases, large amounts of user-generated content need to be analysed quickly, yet it is not currently possible to carry out such complex analyses in real time.

With partners from seven different countries, the project will combine big data analytics with advanced linguistic and visual methods. The results will be suitable for direct application in medical information systems and digital journalism.

Veracity: The Fourth Challenge of Big Data

Social media poses three major computational challenges, dubbed by Gartner the 3Vs of big data: volume, velocity, and variety.

PHEME will focus on a fourth crucial, but hitherto largely unstudied, challenge: veracity

We coined the term phemes to describe memes which are enhanced with truthfulness information. It is a reference also to Pheme – the Greek goddess of fame and rumours.

Identifying Phemes (Rumorous Memes)

We are concentrating on identifying four types of phemes and modelling their spread across social networks and online media: speculation, controversy, misinformation, and disinformation. However, it is particularly difficult to assess whether a piece of information falls into one of these categories in the context of social media. The quality of the information here is highly dependent on its social context and, up to now, it has proven very challenging to identify and interpret this context automatically.

An Interdisciplinary Approach

PHEME has partners from the fields of natural language processing and text mining, web science, social network analysis, and information visualization. Together, we will use three factors to analyse veracity: first, the information inherent in a document itself – that is lexical, semantic and syntactic information. This is then cross-referenced with data sources that are assessed as particularly trustworthy, for example in the case of medical information, PubMed, the biggest online database in the world for original medical publications. We will also harness knowledge from Linked Open Data, through the expertise of Ontotext and their highly scalable OWLIM platform (now GraphDB). Finally, the diffusion of a piece of information is analysed – who receives what information and how, and when is it transmitted to whom?

We aim to release many of the veracity intelligence algorithms as open source, building on Sheffield‘s GATE text mining platform and Saarland‘s expertise in cross-media linking and contradiction detection. This will be complemented by a human-analysed rumour dataset, to be produced by the University of Warwick. Rumour analysis and categorisation will build on the expertise of Prof. Rob Procter from the award-winning Guardian/LSE project on Reading the Riots.

For digital journalists, there will be an open-source dashboard based on Ushahidi‘s SwiftRiver media filtering and verification platform.

The results from the automatic models will be presented in an interactive visual analytics dashboard. This will present rumour diffusion patterns, message types (neutral, confirming, denying, or questioning a rumour), geospatial projections of author distribution and sphere of influence, etc. This dashboard will build on the Weblyzard Media Watch platform – a spin out of MODUL university.

“Rumor intelligence”, that is the ability to identify rumours in near real time, is a big data challenge, for which ATOS – Spain will be building the computational platform.

The project results will be tested, inter alia, in the area of medical information systems. More specifically, the Institute of Psychiatry will look at controversies around new recreational drugs in online discussions and then find out how quickly these feature in patients’ medical records and discussions with doctors. For digital journalism, results will be tested with (the international service of the Swiss Broadcasting Corporation (SBC)) and Ushahidi‘s SwiftRiver media filtering and verification platform. The new technology will enable the veracity of user-generated content to be verified – an activity that is largely carried out manually to date, requiring significant resources.

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 611233.

It started on January 1st 2014, running for 36 months.

EU flag
Be Sociable, Share!