Computational Social Science 1 (Network)
This article is divided into two parts, each highlighting a core methodological domain within Computational Social Science(CSS). Part I (you are now reading) explores network analysis as a fundamental pillar of CSS, detailing how relational data and complex network structures have become central to understanding social systems. Part II will shift focus to social simulation, examining how agent-based models and computational experiments enable researchers to understand issues of social science.
The Ambiguity Between Computational Social Science and Data Science
When we try to define Computational Social Science, a central theme commonly emerges: the application of advanced computational techniques to large-scale datasets in order to understand and analyze human behavior. As scholars like Chris Bail have noted, what sets Computational Social Science apart is not just that it involves analyzing social phenomena, but that it does so with computational methods and extensive data.
Consider the online data science platform Kaggle. A simple keyword search for “social” yields more than 13,000 datasets, each associated with numerous pieces of user-generated code and commentary. This raises an intriguing question: Are these activities and data sets examples of Computational Social Science, or are they simply instances of Data Science applied to social data? The distinction isn’t clear-cut, and here we encounter a fundamental ambiguity. Does choosing a social dataset automatically make the endeavor “Computational Social Science,” or is it still “Data Science,” merely focusing on a particular data type (social data)? The boundaries between these domains are blurred.
But this kind of conceptual fuzziness is not unprecedented. In academia, it is not hard to find fields that resemble each other so closely that their distinctions are more historical and institutional than substantive. Consider the relationship between Statistics and Econometrics. Both deal with data analysis, modeling, and inference, and at a foundational level they share methods and aims. Yet Econometrics emerged with a specific focus on economic and social phenomena, using regression techniques tailored for these domains. Econometrics has thus established itself as a distinct field, though at the introductory level, it can be hard to see why we need a separate name for something that looks so much like Statistics. Adding Machine Learning into this mix makes the landscape even more complex.
By analogy, Computational Social Science and Data Science may share a similar relationship. Is there a clear boundary separating them? Perhaps not. Does it even matter? After all, the contributors on Kaggle—who work on thousands of social datasets—might not think of themselves as doing “Computational Social Science.” They might simply consider themselves data scientists, or they might be indifferent to these labels altogether.
Specialization as a Path to Distinction
If Computational Social Science is akin to Econometrics in the sense that it seeks to carve out its own territory within a broader domain, then where does that specialization lie? Just as Econometrics specialized in regression modeling for economic phenomena, what might Computational Social Science specialize in?
A plausible answer is network analysis. Networks—be they of friendships, communications, collaborations, or online interactions—have expanded the horizons of social science research. They provide structures through which we can study relationships and behaviors at a scale previously unimaginable. The digital revolution, driven by social media and internet technology, has enabled the collection of massive network datasets, sometimes encompassing thousands or even millions of nodes and connections. This revolution has turned what were once small sociograms—simple maps of relationships in small groups—into the study of complex networks representing online communities, infrastructure, financial systems, or interconnected organizations or nations.
A wealth of resources exemplify this shift. The Stanford Large Network Dataset Collection, for instance, curates extensive network datasets culled from social media platforms, communication channels, and other relational data sources. Within the field of Computational Social Science, network analysis has become a major pillar. A simple metric: since the year 2000, searching for the keyword “computational social science” on Google Scholar yields about 1,510 review articles. Of these, roughly 37% also include the term “network analysis,” underscoring the centrality of networks in the computational study of social phenomena. Managing and extracting insights from such vast data requires computationally advanced techniques.
Centrality and Structure: Finding the Key Players and Patterns in the Networks
A hallmark of network analysis is its focus on structure and position. Centrality measures like degree, eigenvector, and betweenness quantify how important or influential a node is, whether by virtue of its direct connections or its role as a bridge between otherwise disconnected parts of the network. Beyond identifying influential nodes, researchers examine transitivity, reciprocity, and community structures to understand how dense clusters form, how social capital circulates, and how hierarchies are established and maintained. Each of these metrics and patterns provides empirical grounding for social theory, connecting observed structural features to concepts like power, cohesion, inequality, and collective action.
Much like in political theories, where identifying and analyzing the role of the center is crucial, the concept of centrality is critical in network analysis. Some actors(nodes) in a network wield greater influence or importance, and various metrics capture different nuances of this importance.
- Degree Centrality:
At its most basic, a node’s importance can be gauged by how many connections (edges) they have. The degree centrality of node i is the sum of the edges connecting i to all other nodes.
$$ \mbox{degree centrality}_i = \sum_jA_{ij}~~~~~~~~\mbox{A: adjacency matrix} $$
- Eigenvector Centrality:
Not all connections are created equal. If your neighbors(alters) are themselves well-connected and influential, your importance (centrality) increases. Eigenvector centrality captures this idea. In other words, being connected to important alters boosts your own centrality:
$$ \mbox{Eigenvector centrality}_i = \sum_jA_{ij}~\mbox{Centrality}_j $$
- Betweenness Centrality:
Instead of counting edges, betweenness centrality focuses on paths. Consider any two nodes s and t. There may be multiple shortest paths (geodesic paths) between them. If a particular node i frequently appears on these shortest paths, i effectively acts as a bridge or broker within the network. A high betweenness centrality signifies a strategic position—actor i can control information flow or act as a gatekeeper, a concept closely related to Ronald Burt’s idea of structural holes.:
$$ \mbox{Betweenness centrality}_i = \sum_{st}n_{st}^i ~~~~~~\mbox{n : geodesic path} $$
Beyond these, there are numerous other metrics—Katz centrality, hub and authority scores, closeness centrality, transitivity, and reciprocity measures—that help researchers identify influential actors, detect communities, and unravel latent hierarchies in social networks.
Statistical and Predictive Models of Networks
Statistical models such as Exponential Random Graph Models (ERGMs) explain how and why specific network structures form. By modeling the probability of edge formation based on structural tendencies, node attributes, and exogenous covariates, these approaches move beyond simple mapping. They enable hypothesis testing, causal inference, and forecasting.
Researchers consider the presence or absence of an edge between two nodes as a random variable and attempt to model the probability of an edge’s existence using network-level statistics.
Let :
$$ Y_{ij} = \left \{ \begin{array}{rcl} 1 & \mbox{edge from i to j} \\ 0 & \mbox{otherwise} \end{array} \right. $$
$$ \begin{equation} Y_{ij} = \begin{pmatrix} y_{1,1} & y_{1,2} & \cdots & y_{1,n} \\ y_{2,1} & y_{2,2} & \cdots & y_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n,1} & y_{n,2} & \cdots & y_{n,n} \end{pmatrix} \end{equation} $$
The objective is to characterize the distribution P(Y = y) for a random network Y with a given set of nodes and edges.
$$ L(\theta)=P(Y|\theta) = \frac{exp(\theta~ g(y))}{ \sum exp(\theta~ g(y))}~~~~~~\theta:parameter,~~ g(y) : network~ statistics $$
The challenge in fitting ERGMs lies in the normalization constant—the denominator—which involves summation over a huge space of possible networks. Even with just eight nodes, over 40,000 possible graphs must be considered. To address this computational complexity, Markov Chain Monte Carlo (MCMC) methods are employed to approximate these distributions and carry out inference.
ERGMs model the probability of edges between nodes as a function of various predictors. Two classes of predictors are considered.
-
Endogenous network processes Dependencies arising from the network structure itself.
- Degree distribution : Number of edges.
- Reciprocity : Tendency for directed edges to be mutual. If Node i connects to Node j, how likely is Node j to reciprocate?
- Triadic closure Tendency for nodes to form triangles. If Node i is connected to Node j and Node k, how likely is it that j and k will connect?
-
Exogenous covariates These covariates are not intrinsic to the network's structure but come from out of network structure. For example, attributes of nodes that might influence the probability of connection.
If nodes represent individuals in a social network, exogenous covariates could include: Demographic attributes, Behavioral traits, and Social status
- Homophily : tendency of nodes with similar attributes to connect more frequently. In simpler terms, it’s the principle that “birds of a feather flock together.”
The ERGM is built on the logistic regression methods. In practice, when specifying ERGMs in R, the predictors are expressed as an additive combination.
Each predictor corresponds to a network statistic (or a covariate effect), and each is multiplied by its respective parameter
$$ logit(P(Y_{ij}=1)) = \theta_1~edges + \theta_2~mutual $$
fit <- ergm(network_data ~ edges + mutual)
# edges : the number of edges
# mutual : a predictor for reciprocity
As computational social science continues to mature, network analysis stands as one of its central pillars. It exemplifies the core ideals of CSS: embracing new data sources, deploying computational methods to tackle complexity. Just as Econometrics carved out a niche within the broader discipline of Statistics by focusing on economic applications and specialized regression techniques, Computational Social Science is similarly seeking to differentiate itself from general Data Science.
In Part II, I will consider another foundational approach within CSS: social simulation. While network analysis has greatly benefited from the emergence of rich relational data and advances in computational techniques, allowing for more data-driven analysis, social simulation follows a fundamentally model-driven approach. In many cases, we may observe macro-level patterns or phenomena but obtaining the underlying micro-level data that generates these patterns can be extremely challenging. This is where social simulation plays a critical role. By creating computational models that simulate individual behaviors and interactions, researchers can reproduce and study emergent macro-level phenomena. Social simulation acts as a bridge between theoretical understanding of macro patterns and the practical challenges posed by incomplete or absent micro-level data. The detailed discussion of social simulation will continue in Part II.
Reference
McLevey et al. 2023. The Sage Handbook of Social Network Analysis. SAGE Publications Limited.
Newman, Mark. 2018. Networks. Oxford university press.
Easley et al. 2010. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge university press
Cranmer et al. 2020. Inferential network analysis. Cambridge University Press