1. GNN Basics, Centrality, PageRank, Community Detection

Graph Definition¶

Basic Graph¶

A basic graph is a tuple $G = (V, E)$ , where:

$V$ is a set of nodes (or vertices).
$E$ is a set of edges, such that $E \subseteq V \times V$ .

Edges can be recorded as an adjacency matrix of $|V|$ rows and columns, with a flag at the intersections of respective rows and columns indicating an edge between the two vertices. If the matrix is symmetric, the graph is undirected; otherwise, it is directed.

Graph Classification¶

Attributed Graph¶

Nodes or/and edges have an assigned vector of additional information, such as age, name, address, telephone number.

Homogeneous vs. Heterogeneous Graphs¶

Homogeneous Graphs: Nodes and edges have the same types and the same set of attributes or properties.
Heterogeneous Graphs: Nodes and edges have different types and attributes. The types for nodes and edges play important roles in heterogeneous graphs and should be further considered.
- For example, nodes in a computer network may represent users, computers, printers, routers, etc.

Static vs. Dynamic Graphs¶

Static Graphs: The input features or the topology of the graph do not vary with time.
Dynamic Graphs: The input features or the topology of the graph vary with time. The time information should be carefully considered in dynamic graphs.
- For instance, analyzing molecule interactions versus the inflow of financial transactions.

Machine Learning Tasks on Graphs¶

Link Prediction¶

Link prediction is the problem of predicting the existence of a link between two vertices in a graph (two entities in a network).

Example: determining the probability that there is an interaction between two drugs.

Node Classification¶

Node classification is the problem of assigning a node to a category.

Example: Identifying the role of a computer in a network.

Graph Classification¶

Graph classification is the problem of assigning a graph to a category.

Example: determining whether a molecule is lethal for Staphylococcus aureus.

Centrality Metrics¶

Definition¶

Centrality assigns a rank or score to a vertex in a graph, indicating how important the vertex is within the graph.

Usage¶

Centrality measures are used to identify key nodes in a network, such as influential people in social networks or critical nodes in infrastructure networks.

Types of Centralities¶

Betweenness Centrality:
- Is based on the number of shortest paths passing through a node.
- Formula: $g(v) = \sum_{v,s,t} \frac{|\sigma_{s,t}(v)|}{|\sigma_{*,*}|}$ ,
  - Where $v, s, t \in V$ and $v \neq s \neq t$ , $\sigma_{s,t}(v)$ is the number of shortest paths passing through vertex $v$
  - Where $\sigma_{*,*}$ is the number of shortest paths between all nodes in the graph.
- In a connected graph, the number of shortest paths is $(N-1)(N-2)$
Degree Centrality:
- Is derived from the number of edges to a particular node.
- Formula: $C(n) = \sum_{j=1}^{N} a_{ij}$ ,
  - Where $a_{ij}$ represents the adjacency matrix entries.
Closeness Centrality:
- Is derived from the length of the shortest paths starting at a given node.
- Formula: $C(n) = \frac{1}{\sum_{v \in V} d(n,v)}$
  - Where $d(n,v)$ is the distance between nodes $n$ and $v$ .
Flow Centrality:
- Is derived from the flow passing through the node.
- Formula: $C(n) = \sum_{s,t \in V; s \neq t} \text{max flow}(s, t)$ .

PageRank¶

Definition¶

PageRank assigns a likelihood that in a random walk, the walker hits a given vertex.

Usage¶

Initially used in Google's web search algorithm to rank web pages.

Algorithm to Calculate PageRank¶

Initialize the Graph
- Represent the web as a directed graph $G$ , where each node $n$ represents a web page and each directed edge $(u, v)$ represents a hyperlink from page $u$ to page $v$ .
Initialize PageRank Values
- Assign an initial PageRank value to each node $PR(n)$ . Typically, this is set to $\frac{1}{N}$ for each node, where $N$ is the total number of nodes in the graph. This ensures that the initial probability distribution is uniform across all pages.
Calculate Outbound Links
- For each node $n$ , count the number of outbound links $L(n)$ it has. This step is crucial for distributing the PageRank values of each node to its linked nodes.
Iterate PageRank Calculation
- Update the PageRank value for each node $n$ using the formula that combines the probability of randomly visiting a vertex (damping factor) and the contribution from inbound links.
- Vertex Visiting Probability:
  - Concept: Reflects the probability of moving from one node to another in a random walk.
  - Formula:
    - Where $M(n)$ is the set of nodes that link to $n$ .
    - $PR(u)$ is the PageRank of node $u$ .
    - $L(u)$ is the number of outbound links from node $u$ .
- Damping Factor:
  - Concept: Accounts for the probability that a user will continue clicking on links.
  - Formula:
    - Where $d$ is the damping factor.
    - $\frac{1 - d}{N}$ represents the probability of randomly jumping to any page in the graph.
- Combining Both:
  - Iterative Update Formula: $PR(n, t+1) = \frac{1-d}{N} + d \sum_{u \in M(n)} \frac{PR(u, t)}{L(u)}$
  - Start with an initial probability distribution where each vertex has an equal probability: $PR(n, 0) = \frac{1}{N}$
- Check for Convergence
- Repeat the iterative update until the PageRank values converge, meaning the values change by less than a small threshold $\epsilon$ between iterations. This usually indicates that the algorithm has reached a steady state.

Community Detection¶

Definition¶

Community detection identifies groups of nodes within which connections are dense but between which connections are sparse.

Usage¶

Community detection is used to find clusters or communities in networks, such as social networks or biological networks.

Louvian Algorithm for community detection¶

Initialize the Graph
- Represent the network as a graph $G$ , where each node $n$ represents an entity (e.g., a person, a molecule) and each edge $(u, v)$ represents a connection between entities (e.g., a friendship, a chemical bond).
Initialize Community Assignments
- Assign each node to its own community. Initially, each node $n$ is its own community $C(n) = \{n\}$ .
Modularity Calculation
1. Modularity Concept:
  - Concept: Measures the strength of the division of a network into clusters. High modularity indicates dense connections within clusters and sparse connections between clusters.
  - Formula:
    - Where $A_{ij}$ is the adjacency matrix.
    - $k_i$ is the sum of the weights of the edges attached to node $i$ .
    - $m$ is the total weight of all the edges in the graph.
    - $c_i$ and $c_j$ are the communities of nodes $i$ and $j$ .
    - $\delta(c_i, c_j)$ is 1 if $c_i = c_j$ and 0 otherwise.
Iterate Community Detection
- Modularization: Find communities by optimizing modularity.
- Grouping: Aggregate nodes from the same community into super-nodes and repeat the modularization process.
- Modularization:
  - Step-by-Step Process:
    1. Assign each node $i$ to its own community $c_i$ .
    2. For each node $i$ , consider removing it from its current community and placing it in the community of each neighbor $j$ , one at a time.
    3. Calculate the change in modularity $\Delta Q$ for each possible move and select the move that results in the highest increase in modularity.
    4. Repeat this process until no single move can increase modularity.
- Grouping:
  - Step-by-Step Process:
    1. Once modularization is complete, each community is treated as a single node (super-node).
    2. The graph is then reduced to a new graph of these super-nodes.
    3. The process of modularization is applied again to this new graph.
Check for Convergence
- The iterative process continues until the modularity no longer increases significantly between iterations, indicating that the community structure has stabilized.