Complex Network Theory CS60078: Department of Computer Science & Engineering, IIT Kharagpur
Complex Network Theory CS60078: Department of Computer Science & Engineering, IIT Kharagpur
Complex Network Theory CS60078: Department of Computer Science & Engineering, IIT Kharagpur
) = V (G), and an
edge exists between 2 nodes v
i
, v
j
in G
i=j
d
ij
If the graph is disconnected, it makes sense to consider the reciprocal of the harmonic
mean; this is because the distance between two nodes belonging to separate components
2.12. ARTICULATION POINTS 9
Figure 2.6: Regular lattice - each node is linked to its four immediate neighbours
Figure 2.7: The graph K
4
(extreme left) and its planar embeddings
is innite, the reciprocal being 0.
l =
1
d
1
ij
=
_
1
N(N 1)
i=j
d
ij
_
1
2.12 Articulation Points
An articulation point or cut point is a vertex whose removal increases the number of
components in the graph. Such points are called brokers in social networks. Removal of
brokers creates communities that are totally isolated from each other.
2.13 Bridges
An edge is called a bridge if its removal increases the number of components in the graph.
2.14 Connection Density
The connection density in a graph is a metric used to estimate the number of connec-
tions in the graph. It is dened as the ratio of the number of edges actually present in
10 CHAPTER 2. BASICS OF GRAPH THEORY
Figure 2.8: Nodes G and H are articulation points
Figure 2.9: A graph with 6 bridges (highlighted in red)
the graph and the maximum number of edges possible.
cd =
|E|
_
n
2
_ =
2|E|
N(N 1)
An alternate interpretation of this metric can be the probability that an edge exists
between a randomly chosen pair of vertices.
2.15 Chromatic Number
A proper colouring of a graph is an assignment of labels to each vertex of the graph such
that no two adjacent vertices receive the same label. The chromatic number of a graph
is the minimum number of colours required to achieve a proper colouring. This concept
nds applications in problems such as job scheduling and register allocation among others.
2.16 Chordal Graphs
A graph is chordal if each of its cycles of four or more nodes has a chord, which is an
edge joining two nodes that are not adjacent in the cycle.
2.16. CHORDAL GRAPHS 11
Figure 2.10: A 3-coloring for the Petersen graph
Figure 2.11: A cycle (black) with two chords (green)
Chapter 3
Basic Metrics for Network Analysis
In this section, we describe some metrics that are useful for statistical analysis of complex
networks.
3.1 Degree Distribution
We shall describe the concept of degree distribution with the help of an example. Consider
the case of citation networks. Scientic papers refer to works done earlier on related
topics via citations. In a citation network each node represents a scientic paper and
a directed edge from node A to B indicates that A has cited B. An important thing to
note is that citation networks are acyclic in nature.
Alfred Lotka analysed such networks in 1926. Lotkas Law describes the frequency
of publication by authors in any given eld. It states that the number of authors making
n contributions to that eld is approximately n
=k
p
k
12
3.1. DEGREE DISTRIBUTION 13
Figure 3.1: An example power-law graph, being used to demonstrate ranking of popularity.
To the right is the long tail, and to the left are the few that dominate (also known as the
80-20 rule)
for discrete distributions, while for continuous distributions we have
P
k
=
_
k
=k
p
k
dk
So, P
k
can also be interpreted as the probability that the degree of a node selected
uniformly at random is greater than or equal to k.
3.1.3 Scale-Free Functions
A scale-free function f(x)is one in which the independent variable x when rescaled does
not aect the functional form of the original function. Mathematically,
f(ax) = bf(x)
Figure 3.2: Random versus Scale-free network
Power Laws are scale-free functions, that is, at any scale, they still show power law
behaviour. Other examples where such behaviour is manifested include fractals.
14 CHAPTER 3. BASIC METRICS FOR NETWORK ANALYSIS
Figure 3.3: The Mandelbrot set is a famous example of a fractal
3.2 Transitivity
A transitive network is one in which for any 3 nodes a, b and c, if there exists an edge
between a and b, and between b and c, then there exists an edge between a and c as well.
3.2.1 Measuring Transitivity: Clustering Coecient
The clustering coecient for a vertex v in a network is dened as the ratio between
the total number of connections among the neighbors of v to the total number of possible
connections between the neighbours. Mathematically,
C
v
=
L
_
n
2
_
where L = the number of actual links between the neighbours of v, and n = the number
of neighbours of v.
The clustering index of the whole network is the average of the clustering coecients
of all the vertices. That is,
C =
1
N
C
v
Note that higher the clustering index, larger the number of triangles in the network.
Figure 3.4: Local clustering coecient values
3.3. CENTRALITY 15
Figure 3.5: The most important vertices according to degree-centrality (red)
3.3 Centrality
Centrality is a measure indicating the importance of node in the network. Commonly, it
measures the 4 Ps - prestige, prominence, (im)portance and power. We wll get a better
idea of what is meant by importance as the section progresses.
3.3.1 Degree Centrality
Degree centrality is dened as the ratio of the number of neighbours of a vertex with
the total number of neighbours possible. Mathematically,
Degree Centrality =
k
N 1
where k is the degree of the vertex, and N is the total number of nodes in the network.
The variance of the distribution of degree centrality in a network gives us the central-
ization of the network. One can see that a star network is an ideal centralized network,
whereas a line network is less centralized.
Figure 3.6: Star Network and Line Network
3.3.2 Betweenness Centrality
The degree of a node is not the only measure of the importance of a node in the net-
work, and this centrality measure addresses this fact. This concept was introduced by
Linton Freeman. In his conception, vertices that have a high probability of occuring on
a randomly chosen shortest path between two nodes are said to have high betweenness
centrality.
Formally, centrality of a vertex v is dened as the summation of the geodesic path between
16 CHAPTER 3. BASIC METRICS FOR NETWORK ANALYSIS
any two nodes s and t via v, expressed as a fraction of the total number of geodesic paths
between s and t. Mathematically,
g(v) =
s=v=t
st
(v)
st
If v is an articulation point, then we can further simplify this as follows. Let the two
components that the removal of v divides the graph into be C
1
and C
2
with N
1
and N
2
nodes respectively. Then,
g(v) = 2
sC
1
,tC
2
st
(v)
st
= 2
sC
1
,tC
2
1
= 2N
1
N
2
Removal of a node with high betweenness centrality can lead to increase in the geodesic
path lengths, and in the extreme case, the network might even get disconnected as ex-
hibited in the case above. In real world networks, this can be important; for example, to
prevent the spread of a disease in an epidemic network.
Figure 3.7: Hue (from red=0 to blue=max) shows the node betweenness.
3.3.3 Flow Betweenness
Suppose two nodes are connected by a reluctant broker (cut vertex), that is, the short-
est path between them is blocked. Then, the nodes should use another pathway which is
connecting them, rather than simply using the geodesic path.
The ow betweenness measure thus expands the notion of betweenness centrality. It
assumes that any two nodes would use all the paths connecting them, instead of only
using shortest path. However, it is to be noted that calculating ow betweenness is
computationally intractable.
3.3. CENTRALITY 17
3.3.4 Eigenvector Centrality
This metric assigns relative scores to all nodes in the network based on the concept that
connections to high-scoring nodes contribute more to the score of the node in question
than equal connections to low-scoring nodes. For example, consider the context of HIV
transmission. A person x with one sexual partner is seemingly less prone to the disease
than a person y with multiple partners. However, we must also take into account the
number of partners that the sexual partner of x has. That is, it is not enough to merely
gauge the popularity of a particular node on the basis of its degree; we must also take into
account the popularity of its neighbours as well. This is the basic idea of eigenvector
centrality.
We now proceed to dene the centrality value of a vertex as a sum of centralities of
its neighbours. To begin with, we initially guess that a vertex i has centrality x
i
(0). We
gradually improve this estimate by employing a Markov model, and continue in this man-
ner until no more improvement is observed. The improvement made at step t is dened
as,
x
i
(t) =
j
A
ij
x
j
(t 1)
x(t) = Ax(t 1)
= A
t
x(0)
This is known as the Power Iteration method proposed by Hotelling.
Now, express x(0) as a linear combination of eigenvectors v
i
of the adjacency matrix A
x(0) =
i
c
i
v
i
x(t) = A
t
i
c
i
v
i
We know from our knowledge of eigenvectors that A
t
x =
t
x holds, where is an eigen-
value. Using this with the equation above, we have
x(t) =
t
i
c
i
v
i
=
t
1
i
_
1
_
t
c
i
v
i
x(t)
t
1
=
i
_
1
_
t
c
i
v
i
In the limit t ,
_
1
_
t
remains only for i = 1. Thus,
lim
t
x(t)
t
1
= c
1
v
1
Thus, we get that the limiting centrality is proportional to the principal eigenvector v
1
.
18 CHAPTER 3. BASIC METRICS FOR NETWORK ANALYSIS
Note that directed acyclic networks suer from the problem of zero centrality. If there
exists a node A with no incoming edges, then this node has zero centrality (the assump-
tion seems reasonable for a web page). Consider another node B that has one incoming
edge from A. Then the eigenvector centrality of B is 0 because A the centrality of A is 0.
Hence, in a similar fashion, all the centralities in an acyclic network become 0. We will
see how this problem is remedied by the Katz Centrality metric.
3.3.5 Katz Centrality
As discussed previously, eigenvector centrality metric suers from the problem of zero
centrality. Katz Centrality resolves this issue by assigning to each node a priori some
positive centrality value. This is done according to the following equation
x
i
=
j
A
ij
x
j
+
Note that , > 0. In matrix terms, the above equation is equivalent to
x = Ax +1
where 1 = (1, 1, , 1)
T
. On simplifying, we obtain
x = (I A)
1
1
Instead of inverting the matrix as above, we can alternatively iterate over the following
equation until convergence
x(t) = Ax(t 1) + 1
3.3.6 PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google
Internet search engine, that assigns a numerical weighting to each element of a hyperlinked
set of documents, such as the World Wide Web. The PageRank of a page is dened re-
cursively and depends on the number and PageRank metric of all pages that link to it.
A page that is linked to by many pages with high PageRank receives a high rank itself.
If there are no links to a web page there is no support for that page.
Simply put, the algorithm can be described as follows. PageRank can be thought of
as a probability distribution representing the likelihood that a person randomly clicking
on links will arrive at any particular page. The PageRank computations require several
passes through the collection to adjust approximate PageRank values to more closely re-
ect the theoretical true value.
Essentially, PageRank is nothing but a variant of Katz Centrality. It can be mathe-
matically expressed as follows.
x
i
=
j
A
ij
x
j
k
out
j
+
where k
out
j
is the out-degree of node j. This normalization is done to obtain a stochastic
matrix (a matrix where either all the rows or all the columns sum to one). Note that
3.3. CENTRALITY 19
the above denition does not take into account the possibility of k
out
j
= 0. To solve
this problem, set k
out
j
= 1 in the above calculation, since a vertex with zero out-degree
contributes zero to centralities of other vertices. In matrix terms, we have
x = AD
1
x + 1
x = (I AD
1
)
1
1
where D is a diagonal matrix such that
D
ii
= max {k
out
i
, 1}
3.3.6.1 Random Walks
A random walk is a mathematical formalisation of a trajectory that consists of taking
successive random steps. It was introduced by Karl Pearson in 1905.
Random walks are useful to analyze web surng and to calculate PageRank values. Con-
sider web surng, initially, every page is chosen uniformly at random. With probability
, the surfer performs random walk by randomly choosing the hyperlinks in that page,
and with probability 1 , the surfer stops the random walk. We already know that
the steady state probabiliity that a web page is visited during web surng represents its
PageRank.
The transition matrix for web surng is obtained from the adjacency matrix representing
the underlying graph structure. The transition matrix is a stochastic matrix, all rows
sum to 1, and is thus obtained by dividing each number in each row by the sum of the
elements in that row in the adjacency matrix. Essentially, an entry in the transition ma-
trix represents the probability with which that link is chosen.
As an example, consider the following graph and its equivalent adjacency matrix
_
_
0 1 0
0 0 1
1 1 0
_
_
For the above graph, the transition matrix is given as,
_
_
0 1 0
0 0 1
1/2 1/2 0
_
_
Here, we pictorially show a random walk on this network.
20 CHAPTER 3. BASIC METRICS FOR NETWORK ANALYSIS
Figure 3.8: Random walk on the graph
3.3.7 Hubs and Authorities
Hyperlink-Induced Topic Search (HITS) (also known as hubs and authorities)
is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a
precursor to PageRank. The idea behind Hubs and Authorities stemmed from a particu-
lar insight into the creation of web pages when the Internet was originally forming; that
is, certain web pages, known as hubs, served as large directories that were not actually
authoritative in the information that it held, but were used as compilations of a broad
catalog of information that led users directly to other authoritative pages. In other words,
a good hub represented a page that pointed to many other pages, and a good authority
represented a page that was linked by many dierent hubs.
The scheme therefore assigns two scores for each page: its authority, which estimates
the value of the content of the page, and its hub value, which estimates the value of its
links to other pages. Mathematically, these two centrality values are expressed as follows.
The authority centrality of a node (x
i
) is proportional to the sum of hub centralities
of nodes (y
j
) pointing to it, and is dened as
x
i
=
j
A
ji
y
j
The hub centrality of a node is proportional to the sum of authority centralities of nodes
pointing to it, and is dened as
y
i
=
j
A
ij
x
j
In matrix terms, x = A
T
y, and y = Ax. Solving these two equations gives us
x = A
T
Ax
y = AA
T
y
3.4. RECIPROCITY 21
where x converges to the prinicipal eigenvector of A
T
A, and y converges to the principal
eigenvector of AA
T
.
3.3.8 Co-citation Index & Bibliographic Coupling
Bibliographic coupling or co-citation occurs when two works reference a common works
in their bibliographies. It is an indication that the two works treat related subject matter.
For example, consider works A & B that both cite works C & D. A & B have never
cited each other, nor have C & D. However, intuitively, there seems to be a latent rela-
tionship between A & B, as well as between C & D. The relationship between A & B is
captured by bibliographic coupling and is equal to AA
T
. Similarly, the relationship
between C & D is captured by co-citation index and is equal to A
T
A.
3.3.9 Closeness Centrality
The Closeness Centrality measure uses not only the neighbors of a node to determine
its centrality, but also the neighbors of the neighbors. Therefore, nodes that are not
directly connected to the given node are also considered. Nodes that are not directly con-
nected with the given node receive a lower weight because the intensity of their relation
or their inuence is lower.
Formally, closeness centrality is a measure of the mean distance from the given node
i to all other nodes. Let d
ij
be the length of the geodesic path from node i to node j.
Then, the mean geodesic distance from vertex i to the other nodes can be expressed as
l
i
= (N)
1
j
d
ij
where N = total number of nodes. Note that when j = i, d
ij
= d
ii
= 0; so, it is better to
use
l
i
= (N 1)
1
i=j
d
ij
The mean geodesic distance gives low values for more central vertices. Therefore, we
consider the reciprocal as the value of the centrality, and
C
i
= l
1
i
=
N
j
d
ij
However, this expression suers from sparsely placed values, and does not account for
disconnected network components. The measure therefore can be further rened by con-
sidering the harmonic mean. The centrality value then becomes
C
i
=
j
d
1
ij
N 1
3.4 Reciprocity
The concept of reciprocity can be described as follows. If there is a directed edge from
node i to node j in a directed network and there is also an edge from node j to i, then
22 CHAPTER 3. BASIC METRICS FOR NETWORK ANALYSIS
the edge from i to j is said to be reciprocated. Pairs of reciprocated edges are called
co-links.
Formally, the reciprocity r is dened as the fraction of edges that are reciprocated.
Thus, it can be expressed as
r =
ij
A
ij
A
ji
m
where m is the total number of edges.
3.5 Rich-Club Coecient
In a network, when inuential people (nodes) come together to collaborate on something,
they form what is called a Rich club. As an example, hubs in a network are generally
densely connected, and form a rich-club.
Formally, the rich-club of degree k of a network G = (V, E) is the set of vertices with
degree greater than k. This can be mathematically expressed as,
R(k) = {v V |k
v
> k}
The rich-club coecient of degree k is given by,
#edge(i, j)
|R(k)||R(k) 1|
, where (i, j) R(k)
3.6 Entropy of Degree Distribution
The entropy of the degree distribution of a network provides an average measure of
its heterogeneity. Mathematically, it can be expressed as,
H =
k
p
k
log(p
k
)
Intuitively, we can see that the entropy of the degree distribution of a regular graph is 0,
and is maximum for a graph having a degree distribution that is distributed uniformly.
3.7 Matching Index
A matching index can be assigned to each edge in a network in order to quantify the
similarity between the connectivity pattern of the two vertices adjacent to that edge. A
low value of matching index would indicate dissimilar regions of the network, with the
edge serving as a shortcut between distant regions in the network.
Formally, the matching index of edge (i, j) is dened as
ij
=
k=i,j
A
ik
A
kj
k=j
A
ik
+
k=i
A
jk
Chapter 4
Social Networks
In this section, we study properties and metrics that are useful to describe and analyze
social networks.
4.1 Assortativity
Assortativity (also known as homophily) can be described as the preference for a net-
works nodes to attach to others that are similar or dierent in some way. Nodes that are
similar are assortative, and nodes that are dierent are termed disassortative. Though
the specic measure of similarity may vary, network theorists often examine assortativity
in terms of a nodes degree.
4.1.1 Measuring Assortativity
One means of capturing the degree correlation is by examining the properties of k
nn
, or
the average degree of neighbors of a node with degree k. This term is formally dened as
k
nn
=
P(k
|k)
where P(k
|k) is the the conditional probability that an edge of node degree k points to a
node with degree k
ij
e
ij
= 1,
j
e
ij
= a
i
,
i
e
ij
= b
j
where a
i
and b
i
are the fractions of each type of an edges end that is attached to nodes of
type i (see Figure). Then, an assortativity coecient, a measure of the strength of similar-
ity or dissimilarity between two nodes on a set of discrete characteristics can be dened as
r =
i
e
ii
i
a
i
b
i
1
i
a
i
b
i
This formula yields r = 0 when there is no assortative mixing, and r = 1 when the net-
work is perfectly assortative. If the network is perfectly disassortative, the formula yields
r
min
=
i
a
i
b
i
1
i
a
i
b
i
Figure 4.1: Mixing Patterns
4.2 Signed Graphs
A signed graph is a graph in which each edge has a positive or negative sign. Such graphs
have been used to model social situations, with positive edges representing friendships and
negative edges representing enmities between nodes, which represent people. The sign of
a cycle in the graph is dened to be the product of the signs of its edges; in other words,
a cycle is positive if it contains an even number of negative edges and negative if it
contains an odd number of negative edges. A signed graph, or a subgraph or edge set, is
called balanced if every cycle in it is positive.
4.3. STRUCTURAL HOLES 25
Figure 4.2: Triads (a) & (b) are stable congurations, while (c) & (d) are unstable
4.2.1 Stability of Cycles
Positive cycles are supposed to be stable social situations, whereas negative cycles are
supposed to be unstable. For example, consider the case of triads, or possible 3-cycles.
Then, a stable triad is either three mutual friends, or two friends with a common enemy;
while an unstable 3-cycle is either three mutual enemies, or two enemies who share a
mutual friend. According to the theory, in the case of three mutual enemies, this is
because sharing a common enemy is likely to cause two of the enemies to become friends.
In the case of two enemies sharing a friend, the shared friend is likely to choose one over
the other and turn one of his or her friendships into an enmity.
4.3 Structural Holes
In a social network, structural holes are nodes that separate non-redundant sources of
information, that is sources that are additive rather than overlapping.
Contacs that are strongly connected to each other are likely to have similar informa-
tion and therefore provide redundant information benets. On the other hand, contacts
that link a manager to the same third parties have same sources of information and
therefore provide redundant information benet.
4.4 Social Cohesiveness
Social cohesiveness refers to the closeness of the members in the social network. In
graph theoretic terms, it refers to the cliquishness of a graph. However, a complete
clique is too strict to be practical and is rarely observed in social networks. In most real
world groups, there are bound to exist at least a few members who are not connected to
26 CHAPTER 4. SOCIAL NETWORKS
each other. We dene some other relaxed and pratical measures of social cohesiveness.
A k-clique is a maximal set S of nodes in which the geodesic path between every pair
of nodes {u, v} S is less than or equal to k. As an examples, consider the network in
Figure 4.3.
{a, b, c, f, e} forms a 2-clique, as the node d causes the distance between the nodes c
Figure 4.3: {a, b, c, f, e} forms a 2-clique
and e to be 2, even though it is not a part of the 2-clique. Thus, k-cliques might not be
as cohesive as they look. To resolve this issue, we consider k-clans.
A k-clan is a k-clique in which the subgraph induced by S has diameter less than or
equal to k. In the previous gure, {b, c, d, e, f} forms a 2-clan. Note than {b, e, f} also
induces a subgraph that has diameter 2, but it does not form a 2-clan, as it is not a max-
imal set. If we relax the maximality condition on k-clans, we get a k-club. {a, b, f, e}
forms a k-club in the network. It is easy to see that any k-clan is both a k-clique and a
k-club.
A k-plex is a maximal subset S of nodes such that every member of the set is con-
nected to exactly n k other members, where n is the size of S. In Figure 4.4, we see
that {a, b, e, d} forms a 2-plex.
A k-core of a graph is a maximal subgraph such that each node in the subgraph
Figure 4.4: {a, b, e, d} forms a 2-plex
has atleast degree k, as shown in Figure 4.5.
4.5. EQUIVALENCE 27
Figure 4.5: k-cores
4.5 Equivalence
In social networks, the dierent roles, or positions, or social categories are dened by the
relations among the actors, represented as nodes. Two actors have the same position or
role to the extent that their pattern of relationships with the other actors is the same.
But how does one dene such a similarity? To do this, we need to dene some measure
of equivalence between actors.
There are many ways in which actors could be dened as equivalent based on their re-
lations with others. Three particular denitions of equivalence have been particularly
useful in applying graph theory to the understanding of social roles and structural posi-
tions, namely, structural equivalence, automorphic equivalence, and regular equivalence.
4.5.1 Structural Equivalence
Two nodes are said to be exactly structurally equivalent if they have the same rela-
tionships with all other nodes, that is, one should be perfectly substitutable by the other.
Simply put, the two must be connected to exactly the same set of neighbors. However,
exact structural equivalence is likely to be rare, particularly in large networks. Therefore,
there is a need to examine the degree of structural equivalence, rather than the sim-
ple presence or absence of exact equivalence.
The degree of equivalence between two nodes i and j can be measured by examining
the number of common neighbors between the two nodes. Formally, it can be expressed
as,
n
ij
=
k
A
ik
A
jk
which, incidentally, is nothing but the ij
th
element of the matrix A
2
for undirected graphs.
Note that this is closely related to the cocitation measure in directed networks. Moreover,
since we are measuring the extent of similarity, the above quantity must be appropriately
normalized. Therefore, the measure can be rened by some alternate considerations
which we enumerate below.
28 CHAPTER 4. SOCIAL NETWORKS
Figure 4.6: Dierent structural equivalence classes
4.5.1.1 Cosine Similarity
This similarity measure is dened as the inner product of two vectors. That is,
similarity(x, y) = cos =
x y
||x|| ||y||
Consider the i
th
and j
th
rows of the adjacency matrix A as vectors. Then the cosine
similarity between vertices i and j is
ij
=
k
A
ik
A
jk
_
k
A
2
ik
_
k
A
2
jk
=
n
ij
_
k
i
k
j
4.5.1.2 Pearson Correlation Coecient
The correlation coecient between rows i and j is dened as
r
ij
=
Cov(X
i
, X
j
)
_
Var(X
i
)Var(X
j
)
=
k
A
ik
A
kj
k
i
k
j
n
_
k
i
k
2
i
n
_
k
j
k
2
j
n
4.5.1.3 Euclidean Distance
The Euclidean distance is dened as
d
ij
=
k
(A
ik
A
jk
)
2
For a binary graph, this is nothing but the Hamming distance. Moreover, to get the
required similarity value, we need to normalize d
ij
by the maximum possible distance
between the nodes. This is achieved none of is neighbors (k
i
) match with the js neighbors
(k
j
). Therefore, the similarity value is
similarity =
d
ij
k
i
+ k
j
4.5. EQUIVALENCE 29
4.5.2 Automorphic Equivalence
Automorphic equivalence is not as demanding a denition of similarity as structural equiv-
alence, but is more demanding than regular equivalence. There is a hierarchy of the three
equivalence concepts: any set of structural equivalences are also automorphic and regular
equivalences. Any set of automorphic equivalences are also regular equivalences. Not all
regular equivalences are necessarily automorphic or structural; and not all automorphic
equivalences are necessarily structural.
Formally, two vertices u and v of a labeled graph G are automorphically equiva-
lent if all the vertices can be re-labeled to form an isomorphic graph with the labels of
u and v interchanged. Two automorphically equivalent vertices share exactly the same
label-independent properties.
More intuitively, actors are automorphically equivalent if we can permute the graph in
such a way that exchanging the two actors has no eect on the distances among all actors
in the graph. If we want to assess whether two actors are automorphically equivalent,
we rst imagine exchanging their positions in the network. Then, we look and see if, by
changing some other actors as well, we can create a graph in which all of the actors are
the same distance that they were from one another in the original graph.
Figure 4.7: Dierent automorphic equivalence classes (color-coded)
4.5.3 Regular Equivalence
Regular equivalence is the least restrictive of the three most commonly used denitions
of equivalence. It is, however, probably the most important for the sociologist. This is
because the concept of regular equivalence, and the methods used to identify and describe
regular equivalence sets correspond quite closely to the sociological concept of a role.
Formally, two actors are regularly equivalent if they are equally related to equiva-
lent others. That is, regular equivalence sets are composed of actors who have similar
relations to members of other regular equivalence sets. The concept does not refer to ties
to specic other actors, or to presence in similar sub-graphs; actors are regularly equiva-
lent if they have similar ties to any members of other sets.
The concept is actually more easy to grasp intuitively than formally. Susan is the daughter
30 CHAPTER 4. SOCIAL NETWORKS
of Inga. Deborah is the daughter of Sally. Susan and Deborah form a regular equivalence
set because each has a tie to a member of the other set. Inga and Sally form a set be-
cause each has a tie to a member of the other set. In regular equivalence, we dont care
which daughter goes with which mother; what is identied by regular equivalence is the
presence of two sets (which we might label mothers and daughters), each dened by
its relation to the other set. Mothers are mothers because they have daughters; daughters
are daughters because they have mothers.
Figure 4.8: Dierent regular equivalence classes
4.5.3.1 Computing Regular Equivalence
To compute a measure of regular equivalence, we capture the following notion - vertices
i and j are similar if i has a neighbor k that is itself similar to j. Mathematically,
ij
=
k
A
ik
kj
+
ij
This can be represented in matrix form as
= A +I
= (I A)
1
4.6 Ego-centric Networks
Social scientists often talk about peoples egocentric networks - the cloud of friends and
acquaintances that one has, which if diagrammed would have one at the center with edges
connecting him/her to other people in his/her life. Such analyses, however, are most com-
monly used in the elds of psychology or social pyschology, ethnographic kinship analysis
or other genealogical studies of relationships between individuals.
To better understand the idea, we need some denitions.
The Ego is an individual focal node. A network has as many egos as it has nodes.
Egos can be persons, groups, organizations, or whole societies.
4.6. EGO-CENTRIC NETWORKS 31
The nodes to whom the ego is connected are called the Alters.
The Neighborhood is the collection of the ego and all nodes to whom the ego has
a connection at some path length. In social network analysis, the neighborhood is almost
always one-step; that is, it includes only the ego and its alters. The neighborhood also
includes all of the ties among all of the actors to whom ego has a direct connection. The
boundaries of ego networks are dened in terms of neighborhoods.
The N-step neighborhood expands the denition of the size of the egos neighbor-
hood by including all nodes to whom ego has a connection at a path length of N, and all
the connections among all of these actors. Neighborhoods of greater path length than 1
(i.e. egos adjacent nodes) are rarely used in social network analysis. When we use the
term neighborhood here, we mean the one-step neighborhood.
Chapter 5
Community Structures
A network is said to have community structures if the nodes of the network can be
easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is
densely connected internally, while interconnections between these sets are sparse. In this
chapter we study dierent techniques for the identication of community structures, also
known as clustering. Identication of such structures nds many practical applications,
such as the construction of recommender systems, polarity detection, etc. An important
point to note is that the denition of a community is subjective and largely depends on
the application for which clustering is being performed.
Based on the approach employed, clustering techniques can be broadly categorized into
the following computational methods, agglomerative, divisive and spectral.
Agglomerative techniques make use of a bottom-up approach for clustering. Start-
ing with an empty graph G with N nodes and no edges, edges are iteratively added to
the graph, while maximizing some quantity in the original network.
Divisive techniques make use of a top-down approach, removing certain edges from the
original network so that separate community structures are obtained.
Spectral techniques split the graph into community structures based on eigenvalues /
eigenvectors of the Graph Laplacian.
5.1 Similarity Measures
A crucial step in any algorithm to identify community structures is to select suitable
metrics to measure similarity or dissimilarity between nodes. The goal remains to to
group similar data together, which would constitute a community. However, there is no
single method that works equally well in all applications; it depends on what we want
to nd or emphasize in the data. Therefore, correct choice of a similarity measure is
often more important than the clustering algorithm. As discussed in previous chapters,
similarity measures could be obtained as Cosine Similarity, Jaccards Coecient, etc.
32
5.2. AGGLOMERATIVE METHODS 33
5.2 Agglomerative Methods
In this section, we describe some agglomerative approaches towards clustering.
5.2.1 Hierarchical Clustering
The approach is as follows:
1. Start with every data point in a separate cluster
2. Merge the most similar pairs of data points / clusters together, until only a single
cluster remains
The output of the above method is a binary tree, called a dendogram. The root of
this tree is the nal cluster, and each original data item is a leaf. Initially, the tree is
empty, containing only the original data items as leaves. Whenever data items / clusters
are merged together, a node is added to the tree (representing this new cluster) with edges
between this new node and its constituent clusters.
As already mentioned, we could have used any of the previously dened measures of
similarity to estimate the distance between data items. However, we need to dene a
linkage method that can estimate the distance between clusters. Since a data item can
be thought of as a cluster with a single node, this linkage method will suce for data
items as well. Here we enumerate the dierent types of linkages that might be followed
while merging any two clusters:
Single Linkage: The minimum of all pairwise distances between points in the two
clusters
Complete Linkage: The maximum of all pairwise distances between points in the
two clusters
Average Linkage: The average of all pairwise distances between points in the two
clusters
Despite its simplicity, this approach does not scale to large graphs, owing to its O(n
3
)
time complexity in the worst case. Also, the method is not exible; steps once taken
cannot be undone. Another problem this approach suers from is that arbitrary cut-os
need to be set to arrive at a community structure.
5.2.2 Local Algorithm based on Agglomeration
This algorithm, due to James P. Bagrow, agglomerates nodes one at a time, and maintains
two groups - a community C and a border B consisting of the set of nodes adjacent to
the community, i.e. each node in B has atleast one neighbour in C. At each step, a node
from B is chosen and agglomerated into C, then B is update to include any newly dis-
covered nodes. This continues until an appropriate stopping criterion has been satised.
Initially, a node is chosen as the source s, and C = {s}, and B contains the neighbours
of s: B = {n(s)}.
34 CHAPTER 5. COMMUNITY STRUCTURES
Dene the outwardness
v
(C) of a node v B from community C as
v
(C) =
# of neighbours of v outside C # of neigbours of v inside C
k
v
Now, the algorithm moves that node from B to C whose outwardness value is minimum,
breaking ties at random. B is now updated, and the procedure is repeated until the
stopping criterion has been satised.
5.3 Divisive Methods
5.3.1 Girvan-Newman Algorithm
We studied the betweenness of a vertex previously, as a measure of centrality and inu-
ence of nodes in networks. The Girvan-Newman algorithm extends this denition to
the case of edges, dening the edge-betweenness of an edge as the number of shortest
paths between pairs of nodes that run along it. If there is more than one shortest path
between a pair of nodes, each path is assigned equal weight, such that the total weight
of all the paths is equal to unity. If a network contains communities or groups that are
only loosely connected by a few intergroup edges, then all shortest paths between dierent
communities must go along one of these few edges. Thus, the edges connecting commu-
nities will have high edge betweenness (at least one of them). By removing these edges,
the groups are separated from one another and so the underlying community structure of
the network is revealed.
The algorithms steps for community detection are summarized below:
1. The betweenness of all existing edges in the network is calculated rst.
2. The edge with the highest betweenness is removed.
3. The betweenness of all edges aected by the removal is recalculated.
4. Steps 2 and 3 are repeated until no edges remain.
The end result of the GirvanNewman algorithm is a dendrogram. As the GirvanNewman
algorithm runs, the dendrogram is produced from the top down.
The crux of this method lies in the computation of the shortest paths. If we use simple
BFS traversal for this computation, then, this can be done in O(m) time for each source
node, totalling to O(mn) time for all the nodes, where m is the number of edges in the
graph. In the worst case, O(m) edges are removed, therefore, the total complexity of the
algorithm is O(m
2
n), which is equivalent to O(n
3
) for sparse graphs, and O(n
5
) for dense
graphs.
5.3.2 Radicchis Algorithm
This algorithm is a divisive algorithm that is based on the notion that the number of
triangles formed within communities is much higher than the number of triangles across
communities. The algorithm tries to nd the edge clustering coecient of each edge;
we remove the edge with the smallest value of the coecient from the network. This
5.4. MODULARITY OPTIMIZATION 35
coecient is a measure of the number of triangles a particular edge ij is a part of, and is
dened as:
C
ij
=
Z
ij
+ 1
min (k
i
1), (k
j
1)
where Z
ij
= Number of triangles ij is a part of. Note that the denominator of the ex-
pression denotes the maximum number of triangles of which ij could possibly be a part
of, but also 1 is added to the numerator to eliminate the possibility that c
ij
= 0.
This algorithm runs in time O(m
2
) as each iteration of the algorithm requires O(m)
computations, and there can be O(m) such iterations.
5.4 Modularity Optimization
Modularity is a metric that measures the strength of division of a network into commu-
nities. Networks with high modularity have dense connections between the nodes within
communities, but sparse connections between nodes in dierent communitites. Modu-
larity is often used in optimization methods for the detection of community structures.
Intuitively, it can be measured as the total number of in community edges minus the ex-
pected number of edges in the absence of a community structure. Formally, it is dened
as
Q =
1
2m
i,j
_
A
i,j
k
i
k
j
2m
_
(c
i
, c
j
)
where m is the total number of edges, c
i
is the community to which i is assigned, and
(c
i
, c
j
) is 1, if c
i
= c
j
, and 0 otherwise.
Thus, modularity can be used as a stopping criterion in iterative clustering algorithms.
The iterations are performed until modularity reaches a maximum; the point at which
degradation starts is the point where further clustering is not performed.
5.4.1 Newmans Modularity Optimization Algorithm
The above discussion highlights the use of modularity for the evaluation of computed
communities. However, it can itself be used for the purpose of community identication,
as depicted by the following agglomerative algorithm.
1. Initially, all vertices are kept in separate clusters.
2. Join a pair of clusters, such that this results in the greatest increase or smallest
decrease in the modularity, Q (optimizing Q).
3. Repeat.
Note that this agglomerative algorithm is much faster than the Girvan-Newman edge-
betweenness algorithm; there are O(m) computations per step, and O(n) steps, so the
net complexity is O(mn).
36 CHAPTER 5. COMMUNITY STRUCTURES
5.4.2 Modularity Optimization : Blondel et al.
A smarter way of applying modularity optimization due to Blondel et al., is given below.
Each pass of the algorithm comprises two phases: one where modularity is optimized by
allowing only local changes of communities; one where communities found are aggregated
in order to build a new network of communities. The passes are repeated iteratively until
no increase of modularity is possible.
Phase 1
1. All nodes are kept in separate clusters.
2. For each node i, consider all its neighbours j.
3. Check whether placing i in js current community increases Q (gain should always
be positive).
4. Place i in the community for which Q is maximum, breaking ties randomly.
5. Continue until Q is 0.
Phase 2
1. Collapse all communities obtained from phase 1 to single nodes.
2. Multiple edges between the newly obtained collapsed communities are replaced by
a single edge of weight equal to the sum of the weights of the edges connecting them
previously.
An obvious question arises regarding the eect of ordering of vertices on the performance
of the algorithm, however, choosing the order does not aect the algorithm much.
5.5 Infomap
5.6 Spectral Bisection Methods