PSDE: Big Data Tools for Social Science

Social science has in recent decades become a data science. The use of more powerful statistical tools has enabled researchers to improve our understanding of human interactions through inferring causality in observational studies and finding hidden patterns, general relationships and unknown correlations between individuals and groups of individuals.

In the emerging field of Big Data, social media and Internet search companies are collecting, storing and analyzing large amounts of data containing social relationships and social interactions with the goal of profiting from their big data through techniques such as targeted advertising. Big data from social-media is typically unstructured, rapidly-changing, and often contradictory, making it difficult to analyse. Although there are no universal analytical tools that allow us to make sense of social-media data, large-scale graph-structured computation is central to the analysis of social media. The graph-processing tools used by companies, such as Google, include Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. Even if these tools were available to social science researchers, they typically do not have the database skills required to enable analyse the data and its impact.

In this project, we are working on:

  • the development of a domain-specific programming language to specify reproducible Big Data experiments by social scientists on top of our Hadoop Open Platform (Hop);
  • investigating the use of Stratosphere, a data-intensive computational framework ( for processing large graphs that runs on Hop, to be able to adapt to the properties of natural graphs commonly found in social science. This task in being done in cooperation with Spotify.

The tools produced by this project will enable the collaboration between the providers of social media and social scientists to engage in new data-driven studies and, ultimately, better understanding of collective behavior.

Project link


Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling, Scaling HDFS with a Strongly Consistent Relational Model for Metadata, 14th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS’14)
Fatemeh Rahimian, Amir H. Payberah, Sarunas Girdzijauskas, Mark Jelasity, and Seif Haridi, Ja-Be-Ja: A Distributed Algorithm for Balanced Graph Partitioning, The 7th IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO’13), Best Paper Award.


Salman Niazi