An application of the Exponential Random Graph Model on the UK Faculty Dataset

“Fuzzy communities and the concept of bridgeness in complex networks” is a research paper published in 2007 by Tamás Nepusz, Andrea Petróczi, László Négyessy, and Ferenc Bazsó. The paper describes a method for identifying “fuzzy communities” in complex networks, which are groups of nodes that are densely connected to each other but have relatively few connections to nodes outside the group.

The paper introduces the concept of “bridgeness,” which measures the importance of a node in connecting different parts of the network. In social networks class, we refer to this as betweenness (in our centrality exercises). Nodes with high bridgeness are more likely to belong to multiple communities and to play a role in connecting different parts of the network.

This technique was applied to a personal friendship network (UK Faculty), ignoring the directionality and the weight of the edges. The network structure was formed from tie-strength. UK faculty were asked to fill up a questionnaire supposedly measuring this tie-strength.

A fuzzy community detection for three communities was performed on the graph. Then, supposedly, the dataset included explicit information regarding the expected community structure. But this was defuzzified.

This dataset also contained explicit information regarding the expected community structure, since for every node, the authors knew which school in the Faculty does it belong to. So the authors defuzzified the results using the dominant communities for every vertex. This defuzzification revealed that all crisp communities consisted of almost exclusively the members of a single school inside the Faculty. Further, results showed that through degree-corrected bridgeness scores, highly scored invididuals belonged to all three communities at the same time.

In this project, we use ERGMs. Why ERGMs? Because these:

model the effects of those attributes on the formation of links between the people (and hence the network).
also consider endogenous effects of the networks, such as edge density and clustering coefficient, two ways in which the structure of the network of connections can influence edge formation, and thus provides a unified framework to model both endogenous and exogenous effects on the formation of networks of connections between nodes.

We fit the ERGM to the friendship network of a faculty of a UK university. The network statistics of interest are as follows: (1) total number of edges and (2) number of edges between people from same departments. The estimated model will help us understand how connections between the faculty members form, into what structures of the friendship network. We answer the question: Are faculty members from the same department more likely to be friends?

H1: Faculty members coming from the same department are more likely to be friends.
H2: Faculty members coming from the same department, specifically department “X” are more likely to be friends.

H2.a: Faculty members coming from department 1 are more likely to be friends, or department 1 is a friendly department.
H2.b: Faculty members coming from department 2 are more likely to be friends, or department 2 is a friendly department.
H2.c: Faculty members coming from department 3 are more likely to be friends, or department 3 is a friendly department.
H2.d: Faculty members coming from department 4 are more likely to be friends, or department 4 is a friendly department.

Data: The personal friendship network of a faculty of a UK university, consisting of 81 vertices (individuals) and 817 directed and weighted connections. The school affiliation of each individual is stored as a vertex attribute. This data comes from the igraph package.

In the Nodes dataset, there are 81 rows and 3 columns. The three attributes are: the ID, department and the degree.

Attributes	Nodes dataset
ID
department
degree
Rows	81
Columns	3

When tried with centrality measures, there appear to always be three main components, 1 small component and almost all arrows are directed to one of the 3 nodes (even if the dataset shows that directed = false).

Network statistics	Nodes
network_density	0.0125
network_reciprocity	0
network_transitivity	0
network_components	81

Network statistics	Edges
network_density	0.126
network_reciprocity	0.588
network_transitivity	0.473
network_components	2

Attributes	Edges dataset
source
target
degree
Rows	817
Columns	2

## rename columns 
## turn the node and edge tables into a network object
colnames(edges) <- c('head', 'tail')

The Edges dataset has 817 rows and 2 columns. In the Edges dataset, we turn refer the ‘from’ or ‘the source’ as ‘head’ while we refer to the ‘to’ as ‘tail’

## [1] head tail
## <0 rows> (or 0-length row.names)

#model <-  ergm(G ~ edges + nodematch('dept'))
# Fit data into the ERGM
 
 model <-  ergm(G ~ edges + nodematch('dept', diff = T))

MODEL 1: model <- ergm(G ~ edges + nodematch(‘dept’, diff = T))

The ERGM applied to this dataset consists of two terms: (1) total number of edges (edges) and (2) number of edges between nodes from same departments [nodematch(“dept”, diff = T)]. It has The nodematch term represents the number of pairs of nodes that are matched on a categorical variable called “dept”. So it tries to differentiate between the kind of departments.

The estimated coefficients for each term can be interpreted as follows: The (conditional) log-odds of forming an edge between nodes from different departments (edges) is negative and significant. On the other hand, the coefficient for “Same Department” suggests that coming from the same department has a negative effect on edge formation, but it is insignificant and negative. Across the departments, ‘nodematch.dept.4’ and ‘nodematch.dept.3’ are significant variables.

edges: -1.89721: For every one unit increase in the number of edges in the network, the log-odds of observing the network decreases by 1.89721 units. This is significant at 1%. The negative coefficient of this term indicates that being from the same department has a negative effect on the edge formation between the faculty.
nodematch.dept.1: -0.07493: For every pair of nodes that are matched on the “dept1” attribute and have different values for that attribute, the log-odds of observing the network decreases by 0.07493 units.
nodematch.dept.2: 0.01799: For every pair of nodes that are matched on the “dept2” attribute and have different values for that attribute, the log-odds of observing the network increases by 0.01799 units.
nodematch.dept.3: -0.68679: For every pair of nodes that are matched on the “dept3” attribute and have different values for that attribute, the log-odds of observing the network decreases by 0.68679 units. This is significant at 5%.
nodematch.dept.4: -Inf: For every pair of nodes that are matched on the “dept4” attribute and have different values for that attribute, the log-odds of observing the network decreases by an infinite amount. This is significant at 1%.

summary(model)

## Call:
## ergm(formula = G ~ edges + nodematch("dept", diff = T))
## 
## Maximum Likelihood Results:
## 
##                  Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges            -1.89721    0.04488      0 -42.276   <1e-04 ***
## nodematch.dept.1 -0.07493    0.10413      0  -0.720   0.4718    
## nodematch.dept.2  0.01799    0.12003      0   0.150   0.8809    
## nodematch.dept.3 -0.68679    0.21631      0  -3.175   0.0015 ** 
## nodematch.dept.4     -Inf    0.00000      0    -Inf   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 8980  on 6480  degrees of freedom
##  Residual Deviance: 4897  on 6475  degrees of freedom
##  
## AIC: 4905  BIC: 4932  (Smaller is better. MC Std. Err. = 0)
## 
##  Warning: The following terms have infinite coefficient estimates:
##   nodematch.dept.4

par(mfrow=c(2,2))
plot(mod.fit, main="GOF - Model 1")

In the first plot, model statistics and proportion of statistics, ‘nodematch.dept.4’ is a clear outlier.

In the GOF residual plot where the x-axis represents the out-degree (i.e., the number of edges that originate from a node) and the y-axis represents the proportion of nodes, the line underestimates the points and is not a good fit.

In a residual plot where thex-axis represents the in-degree (i.e., the number of edges that are incident on a node) and the y-axis represents the proportion of nodes, the line both underestimates in some points and overestimates in other points and is slightly not a good fit.

In a goodness of fit residual plot where the x-axis represents the number of edge-wise shared partners (i.e., the number of other edges that are connected to the same nodes as a particular edge) and the y-axis represents the proportion of edges, the line overestimates and is not a good fit..

In a goodness of fit plot where the x-axis represents the minimum geodesic distance (i.e., the shortest path between two nodes) and the y-axis represents the proportion of dyads, there is an outlier which makes it somewhat not a good fit.

In Model 2, we add MCMLE.maxit and MCMC.burnin parameters.

MODEL 2: model <- ergm(G ~ edges + nodematch(‘dept’),control=control.ergm(MCMC.burnin=1024*75, MCMLE.maxit=40))

summary(model1)

## Call:
## ergm(formula = G ~ edges + nodematch("dept"), control = control.ergm(MCMC.burnin = 1024 * 
##     75, MCMLE.maxit = 40))
## 
## Maximum Likelihood Results:
## 
##                Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges          -1.89721    0.04488      0  -42.28   <1e-04 ***
## nodematch.dept -0.12364    0.08135      0   -1.52    0.129    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 8983  on 6480  degrees of freedom
##  Residual Deviance: 4908  on 6478  degrees of freedom
##  
## AIC: 4912  BIC: 4925  (Smaller is better. MC Std. Err. = 0)

The estimated coefficients for each term can be interpreted as follows:

The (conditional) log-odds of forming an edge between nodes from different departments (edges) is negative and significant. On the other hand, the coefficient for “Same Department” suggests that coming from the same department has a negative effect on edge formation, but it is insignificant and negative.

The control argument in the ERGM function specifies a set of control parameters for the model fitting process. In this case, the control.ergm function is being used to set two control parameters: MCMC.burnin and MCMLE.maxit.

MCMC.burnin: This parameter specifies the number of burn-in iterations to be used in the Markov Chain Monte Carlo (MCMC) algorithm. Burn-in iterations are a series of initial iterations that are discarded before the MCMC algorithm starts producing samples from the posterior distribution of the model parameters. The purpose of burn-in iterations is to allow the MCMC algorithm to “warm up” and converge on the posterior distribution before producing samples from it.

MCMLE.maxit: This parameter specifies the maximum number of iterations to be used in the maximum likelihood estimation (MLE) algorithm. MLE is a common method for estimating the parameters of a statistical model from data. In the context of an ERGM, MLE is used to estimate the model parameters that best fit the observed network data.

‘mod.fit’ stores the goodness-of-fit (GOF) results for Model1 (the observed network data). The GOF results are based on four statistical measures: distance, espartners, idegree, and odegree.

The gof function in the ergm package in R is used to compute goodness-of-fit measures for an ERGM. The GOF argument specifies the statistical measures to be used in the GOF calculation, and the ~ operator specifies the formula for the GOF calculation. In this case, the GOF formula consists of four terms: distance, espartners, idegree, and odegree.

par(mfrow=c(2,2))
plot(mod.fit1, main="GOF - Model 2")

par(mfrow=c(1,1))

In the first plot, model statistics and proportion of statistic, the points are within the line.

The ergm1mle object stores the Maximum Likelihood Estimate (MLE) of the log-likelihood of the model ERGM.

# Log-likelihood
ergm1mle <- model$mle.lik

ergm1mle

## 'log Lik.' -2448.557 (df=4)

‘log Lik.’ -2448.557: The MLE of the log-likelihood is the value of the log-likelihood that maximizes the probability of observing the data. In this case, the MLE of the log-likelihood is -2453.889. df=4: The degrees of freedom of the model are a measure of the number of independent parameters in the model. In this case, the model has 2 degrees of freedom. The MLE of the log-likelihood and the degrees of freedom of the model can be used to assess the fit of the model to the observed data and compare the fit of different models. A lower value of the MLE of the log-likelihood and a higher value of the degrees of freedom may indicate a better fit of the model to the data.

## Untransformed
model$coefficients

##            edges nodematch.dept.1 nodematch.dept.2 nodematch.dept.3 
##      -1.89720754      -0.07493361       0.01798877      -0.68678961 
## nodematch.dept.4 
##             -Inf

The expression exp(model $coefficients) / (1 + exp(model$ coefficients)) applies the logit transformation to the model coefficients of an ERGM, which can be useful for comparing the magnitude and importance of different model terms.

## Transformed
exp(model$coefficients) / (1 + exp(model$coefficients))

##            edges nodematch.dept.1 nodematch.dept.2 nodematch.dept.3 
##        0.1304249        0.4812754        0.5044971        0.3347476 
## nodematch.dept.4 
##        0.0000000

netnetwork <- migraph::as_network(netigraph)

The simulate function is used to generate synthetic network data from a fitted ERGM model. The simulate function takes a fitted ERGM model as an input and produces a synthetic network dataset as output.

In this case, the model.sim object stores the results of simulating synthetic network data from a fitted ERGM model called model. The model object is an object that stores the results of fitting an ERGM to some network data.

The simulate function can be useful for generating synthetic network data for a range of purposes, such as testing the robustness of a model to different data scenarios, comparing the fit of different models to synthetic data, or generating data for use in machine learning algorithms.

model.sim <- simulate(model)
par(mfrow = c(1,2)) # This command enables side-by-side comparison.
plot(netnetwork, main = "Observed")
plot(model.sim, main = "Simulated")

par(mfrow = c(1,1))

## goodness of fit of transformed model

modelt.gof <- gof(model,  GOF = ~ distance + espartners + idegree + odegree)
par(mfrow = c(2,3))
plot(modelt.gof, plotlogodds = T)
par(mfrow = c(1,1))

Goodness of Fit of Transformed Model

Goodness-of-fit interpretation for transformed model Model statistics and Log-odds for a statistic: the line overestimates nodematch.dept4

In a GOF residual plot where the x-axis represents the out-degree and the y-axis represents the log-odds for a node: the line underestimates and is not a good fit.

In a GOF residual plot where the x-axis represents the in-degree and the y-axis represents the log-odds for a node, the line both underestimates in some points and overestimates in other points and is not a good fit.

In a GOF residual plot where the x-axis represents the number of edge-wise shared partners and the y-axis represents the log-odds for an edge, the line overestimates edge-wise shared partners.

In a GOF residual plott where the x-axis represents the minimum geodesic distance and the y-axis represents the log-odds for a dyad, the line overestimates minimum geodesic distance.

#Some common diagnostics that may be computed by mcmc.diagnostics include the Gelman-Rubin statistic, which is a measure of convergence of the MCMC simulations; the effective sample size, which is a measure of the efficiency of the MCMC simulations; and the autocorrelation of the MCMC simulations, which is a measure of the correlation between samples in the simulations.

mcmc.diagnostics(model_ergm3)

## Sample statistics summary:
## 
## Iterations = 188416:3706880
## Thinning interval = 4096 
## Number of chains = 1 
## Sample size per chain = 860 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                   Mean    SD Naive SE Time-series SE
## edges           1.0849 29.05   0.9906         1.6580
## nodematch.dept -0.4488 15.04   0.5129         0.8184
## gwesp.fixed.0   0.1709 32.08   1.0940         1.6999
## 
## 2. Quantiles for each variable:
## 
##                  2.5%    25% 50%   75% 97.5%
## edges          -54.00 -19.00   0 20.00 58.00
## nodematch.dept -29.00 -11.00   0  9.00 30.00
## gwesp.fixed.0  -62.57 -22.25  -1 21.25 65.52
## 
## 
## Are sample statistics significantly different from observed?
##                edges nodematch.dept gwesp.fixed.0 Overall (Chi^2)
## diff.      1.0848837     -0.4488372     0.1709302              NA
## test stat. 0.6543331     -0.5484302     0.1005519    2.891476e+01
## P-val.     0.5128973      0.5833966     0.9199062    3.544936e-06
## 
## Sample statistics cross-correlations:
##                   edges nodematch.dept gwesp.fixed.0
## edges          1.000000      0.5565660     0.9836890
## nodematch.dept 0.556566      1.0000000     0.5444739
## gwesp.fixed.0  0.983689      0.5444739     1.0000000
## 
## Sample statistics auto-correlation:
## Chain 1 
##                 edges nodematch.dept gwesp.fixed.0
## Lag 0      1.00000000    1.000000000   1.000000000
## Lag 4096   0.43466117    0.330656757   0.413773269
## Lag 8192   0.22807005    0.174599387   0.207123167
## Lag 12288  0.10667554    0.120724292   0.088909622
## Lag 16384  0.01436788    0.064995717  -0.002961505
## Lag 20480 -0.02050709   -0.005900323  -0.032744188
## 
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1 
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##          edges nodematch.dept  gwesp.fixed.0 
##         0.7231        -0.1286         0.6994 
## 
## Individual P-values (lower = worse):
##          edges nodematch.dept  gwesp.fixed.0 
##      0.4696430      0.8976794      0.4843046 
## Joint P-value (lower = worse):  0.8145242 .

## 
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

It is a well-converged MCMC simulation because it shows stable parameter estimates and convergence statistics that are close to their target values.

##Centrality exercises on the nodes dataset
nodes <- nodes %>%
  add_node_attribute("degree", node_is_max(node_degree(nodes))) %>%
  add_node_attribute("betweenness", node_is_max(node_betweenness(nodes))) %>%
  add_node_attribute("closeness", node_is_max(node_closeness(nodes))) %>%
  add_node_attribute("eigenvector", node_is_max(node_eigenvector(nodes)))

gd <- autographr(nodes, node_color = "degree") + 
  ggtitle("1A. Degree Centrality", subtitle = round(network_degree(nodes), 2)) 
gb <- autographr(nodes, node_color = "betweenness") + 
  ggtitle("1B. Betweenness Centrality", subtitle = round(network_betweenness(nodes), 2)) 
gc <- autographr(nodes, node_color = "closeness") + 
  ggtitle("1C. Closeness Centrality", subtitle = round(network_closeness(nodes), 2)) 
ge <- autographr(nodes, node_color = "eigenvector") + 
  ggtitle("1D. Eigenvector Centrality", subtitle = round(network_eigenvector(nodes), 2))
(gd | gb) / (gc | ge)

An application of the Exponential Random Graph Model on the UK Faculty Dataset

Stephanie Rose Flores

2022-12-15