Computing Veracity – the Fourth Challenge of Big Data

Software & Data Downloads

Graph visualisation

Graphyte is a flexible graph visualization library to investigate the evolution of online dialogues; built with an emphasis on customization and modularity, including an Application Programming Interface (API) to create a pipeline of interconnecting modules.

Social media thread processing

The conversation collection script allows the user to collect the set of tweets replying to a specific tweet, forming a conversation or a thread.

Rumour categorisation

An important part of tracking the spread of rumours online is detecting if a message supports, denies, or queries the claim. Our code for this is available on GitHub.

Multilingual preprocessing


Social media processing for English is published in TwitIE, an integral part of GATE

Entity disambiguation is provided with YODIE

We’re also making a generic entity recognition package available


The Bulgarian pipeline is available as a web service. It can handle text directly, or plain text file uploads. To process text directly, call as follows:

curl -X POST --data-binary "Момичето яде сладолед."

To process a file, use:

curl -X POST --data @test.txt


Rumour analyses: journalism use case

This is a dataset collected and annotated within the journalism use case. These rumours are associated with 9 different breaking news. It was created for the analysis of social media rumours, and contains Twitter conversations which are initiated by a rumourous tweet; the conversations include tweets responding to those rumourous tweets. These tweets have been annotated for support, certainty, and evidentiality. This dataset is associated with the D2.4 deliverable.

Rumour analyses: medical use case

An NLP algorithm was developed through the use of 2,400 annotated tweets (training set). The rules created to identify the linguistic patterns indicating a positive reference to mephedrone were then tested on another 2,400 annotated tweets (gold standard set) using GATE. The application was then deployed over the complete dataset of 145,578 tweets retrieved between 2009 and 2014 – 7,044 were identified as true instances of mephedrone.

PHEME RTE dataset

For the special purpose of Natural Language Processing-based information verification, we have built a new Recognizing Textual Entailment (RTE) resource from Twitter data. The PHEME RTE dataset is compiled based on naturally occurring contradiction in manually labeled claims in tweets related to crisis events, and to our knowledge is the first resource for 3-way judgement RTE in the social media and verification domain. From about 500 English tweets related to 70 unique claims we created 5.4k RTE pairs. The RTE pairs are built by a semi-automatic method that is portable across languages and domains, but requires event and claim annotations. The resource, its creation method and pilot RTE evaluation are explained in the following paper:
Piroska Lendvai, Isabelle Augenstein, Kalina Bontcheva, Thierry Declerck (2016). Monolingual Social Media Datasets for Detecting Contradiction and Entailment. Proc. of LREC 2016.

Temporal models of events

Code for Hawkes Process models of the intensity of event discussion over time

Entity recognition

A generic entity recognition toolkit in Python 3, originally designed for named entity recognition but extended to other tasks such as event annotation and timex recognition. This tool relies on Brown clusters and structured prediction, and with default parameters achieved third place in the 2015 W-NUT untyped chunking evaluation.


This collects social media data from various sources, feeding it forward for processing. It can handle multiple requests from multiple sites. Just register a filter, or selector, for data, and Capturean handles the rest. It comprises:

  • Capture (Capturean) software (with all necesary modules)
  • message format translator (for emitting Pheme-compatible messages)
  • kafka monitor
  • MODUL Dashboard adapter

On the Pheme github repository: Capturean

Stance detection

WP4 included the production of stance classification software. Our state-of-the-art approach is provided at:

Be Sociable, Share!