Datasets

As a result of our projects and research, we often generate or collect datasets that can be useful for research purposes. We always try to release such datasets with an open license and publish them with a data descriptor, so that other researchers can re-use these data. Here you can find some of the datasets generated by the team.

Recent Dataset Products

Botbusters - Analysis of the 2019 Spanish General Election

This dataset presents the data collected from Twitter during the observation period (from October 4th, 2019 to November 11th, 2019), where anonymized tweets and users’ data are included. It was used to analyze the presence and behavior of political social bots on Twitter in the context of the November 2019 Spanish general election. Involved users were classified as social bots or humans, after examining their interactions from a quantitative (amount of traffic generated and existing relations) and qualitative (user’s political affinity and sentiment towards the most important parties) perspectives.

Release Date: March 2020

Related Publications

Spotting Political Social Bots in Twitter: A Use Case of the 2019 Spanish General Election

Software accessibility

IEEE DataPort
GitHub

Authors

Javier Pastor-Galindo, Mattia Zago, Pantaleone Nespoli, Sergio López Bernal, Alberto Huertas Celdrán, Manuel Gil Pérez, José A. Ruipérez-Valiente, Gregorio Martínez Pérez, Félix Gómez Mármol.

BEHACOM - A Dataset Modelling Users' Behaviour in Computers

This dataset showcases the behaviour of twelve users interacting with their computers for fifty-five consecutive days, without pre-established indications or restrictions. The BEHACOM dataset contains for each user a set of features that models, in one-minute time windows, the usage of computer resources such as CPU or memory, as well as the activities registered by applications, by following a privacy-preserving approach to protect the collected data.

Release Date: April 2020

Related Publications

Software accessibility

IEEE DataPort
GitHub

Authors

Pedro M. Sánchez Sánchez, José María Jorquera Valero, Mattia Zago, Alberto Huertas Celdrán, Lorenzo Fernández Maimó, Eduardo López Bernal, Sergio López Bernal, Javier Martínez Valverde, Pantaleone Nespoli, Javier Pastor-Galindo, Ángel Luis Perales Gómez, Manuel Gil Pérez, Gregorio Martínez Pérez

UMUDGA - University of Murcia Domain Generation Algorithm Dataset

This dataset showcases a collection of over 30 million manually labeled algorithmically generated domain names, decorated with a feature set ready-to-use for machine learning (ML) analysis. Among a selected number of 50 malware families, each of them is available as a list of domains, generated by executing malware domain generation algorithms (DGAs) in a controlled environment with fixed parameters, as well as a collection of features being generated through the extraction of a combination of statistical and natural language processing metrics.

Release Date: February 2020

Related Publications

“UMUDGA: A Dataset for Profiling Algorithmically Generated Domain Names in Botnet Detection”
“UMUDGA: A Dataset for Profiling DGA-Based Botnet”

Software accessibility

Mendeley Data
GitHub

Authors

Mattia Zago, Manuel Gil Pérez, Gregorio Martínez Pérez

ReCAN - Dataset for Reverse Engineering of Controller Area Networks

This dataset details data obtained from the Controller Area Network (CAN) buses in two personal vehicles and three commercial trucks for a total of 36 million data frames. It is composed of two complementary parts, namely the raw data extracted from the vehicles and the decoded data obtained from the actual sensors’ data. Motivated enough actors may intercept, interact, and recognize vehicle data with consumer-grade technology, ultimately refuting, once-again, the security-through-obscurity paradigm used by the automotive manufacturer as a primary defensive countermeasure.

Release Date: January 2020

Related Publications

“ReCAN – Dataset for Reverse Engineering of Controller Area Networks”

Software accessibility

Mendeley Data
GitHub

Authors

Mattia Zago, Stefano Longari, Andrea Tricarico, Michele Carminati, Manuel Gil Pérez, Gregorio Martínez Pérez, Stefano Zanero

Contact Us

If you are interested to collaborate or know more about our R&D experience in the cybersecurity and data science fields, please contact us.

CyberDataLab UMU. Facultad de Informática, Campus de Espinardo, s/n, 30100 Murcia (Spain).
Diseñado y desarrollado por La Paralela

Datasets

Recent Dataset Products

Botbusters - Analysis of the 2019 Spanish General Election

Release Date: March 2020

Related Publications

Software accessibility

Authors

BEHACOM - A Dataset Modelling Users' Behaviour in Computers

Release Date: April 2020

Related Publications

Software accessibility

Authors

UMUDGA - University of Murcia Domain Generation Algorithm Dataset

Release Date: February 2020

Related Publications

Software accessibility

Authors

ReCAN - Dataset for Reverse Engineering of Controller Area Networks

Release Date: January 2020

Related Publications

Software accessibility

Authors

Contact Us

Links

Links