Graph Neural Networks: a learning journey since 2008 — Deep Walk

Original Source Here

Graph Neural Networks: A learning journey since 2008 — Deep Walk

The third part of our journey into graph neural networks. Today the theoretical background of Perozzi’s 2014 paper DeepWalk: Online Learning of Social Representations

Image by Wladislaw Sokolowskij on Unsplash

In the very first post of this series, we learned how the Graph Neural Network model works. We saw that GNN returns node-based and graph-based predictions and it is backed by a solid mathematical background. In particular, transition and output functions satisfy Banach’s fixed-point theorem. However, despite the successful GNN applications, there are some hurdles, as explained in [1]. The main idea of the GNN model is to build state transitions, functions f𝓌 and g𝓌, and iterate until these functions converge within a threshold. This is a strong constraint that may limit the extendability and representation ability of the model. Secondly, GNN cannot exploit representation learning, namely how to represent a graph from low-dimensional feature vectors. Third, GNN is based on an iterative learning procedure, where labels are features are mixed. This mix could lead to some cascading errors as proved in [6]

To solve these problems, DeepWalk [2] emerged in 2014 as a possible enhancement for GNN. DeepWalk is the first graph to implement the embedding method and it is based on representation learning and language modelling — SkipGram model[3]

DeepWalk model


DeepWalk learns social representations of graph vertices, through modelling a stream of short random walks. Social representations are hidden connections across vertices of a graph, which can be decoded as latent features. Thus DeepWalk reads the input graph and outputs a latent representation of the graph.

How to get to a latent representation of the graph? Well, here we have the main difference to Scarselli’s approach. Scarselli’s GNN is a supervised approach, while DeepWalk is an unsupervised method, so it has to learn from graph features to understand and catch the graph structure and targets, independently from nodes or graph’s labels. This mapping approach has been proved to be more efficient than other social dimensions classifiers [4,5], increasing classification performance of 5–10% of Micro F1 score, outperforming “competitors” even given 60% less training data. Finally, DeepWalk framework allows easy scalability, to cope with web-scale graphs such as for Facebook, Twitter or Youtube.

Mathematical insights I: definitions

DeepWalk returns insightful latent representations of an input graph based on nodes’ social attributes. The graph is defined as a mathematical function G=(V, E), which is made of V vertices and E edges. Vertices may be partially labelled, Y labels, and they are described by attributes/features X. Attributes/features are part of the domain of the real numbers, whose dimension is S, and labels are defined by real numbers vector.

Given the input real domain, DeepWalk translates nodes’ features X into a latent domain, whose dimension is d. This mapping describes social phenomena with a smaller subset of features, which could help in better understanding the relationship of the underlying data and have an immediate visual answer when ingested into PCA or t-SNE algorithms.

To understand the theory and the process underneath DeepWalk, before dealing with the practical implementation, I provide here an example with the Karate Club graph. Fig.1 shows the input graph G, where the vertices are 34 and the edges 78.

Fig.1 Input Karate club graph

The input feature matrix X depicts relationships between nodes and we want to use DeepWalk to find a latent representation of the graph, to find the perfect groups division in the club. For example, the latent representation dimension can be 10, thus DeepWalk will produce 10 new features to subdivide the graph into groups.

Mathematical insights II: generation of random walks

To achieve the latent representation DeepWalk starts to create γ (e.g. 100) random walks from the input graph. A random walk Wvᵢ starts at a vertex vᵢ and randomly select k (e.g. 10) other graph’s vertices vₖ (for example, see fig.2)

Fig.2: Examples of truncated random walks for vertices 22 and 31. Random walks allows to easily explore at the same time multiple graph areas

The selection of random walks allows the algorithm to extract information from a network, guaranteeing on one side a computational easy parallelisation and the other side a dynamic way of exploring the graph, which can encapsulate new information once the graph has changed. This latter aspect allows an iterative update of the model training from a single graph subset, without the problem of the whole graph computation.

Mathematical Insights III: Language modelling

Once γ (e.g. 100) “truncated” random walks have been collected it is possible to exploit some language modelling to extract the latent description of the input social representation. In particular, the collection of random walks can be seen as an encoded collection of sentences (fig.3 shows 10 examples of random walks) made up of k words w: Wvᵢ = (w₀, w₁, …wₖ). From here DeepWalk computes the word-embeddings or the node-embedding representation of all the input corpus G. This is a remarkable point because it’s the first time in the graph world that someone writes about the concept of node-embeddings.

Fig.3: 10 random walks starting from vertices 22, 13, 31, 16, 10, 5, 17, 16, 24 and 18.

Word-embedding can be described by a function Φ whose values live in the real-number domain. Φ is a matrix, whose dimensions are the number of vertices V and the latent-dimension d (in this case 34 x 10). The goal is to retrieve the likelihood of observing a vertex vᵢ given all the previous visited vertices embeddings:

Eq.1: Likelihood for a vertex vi to be observed given all the previous vertex embeddings

However, this is calculation depends on graph dimension and length of random walks, as they are growing eq.1 is computational, not efficient. A solution is proposed in [7] [8] where the problem is reversed. Rather than predicting the occurrence of a missing word in a sentence, we are going to compute the likelihood of a word appearing as a neighbour in a given context, as described in fig.4

Fig.4: Intuitive example of SkipGram approach. Given a sentence we want to predict the likelihood of a word to be neighbour of a given one (e.g. walk).

Such an approach is called SkipGram [7–11] and aim to solve this optimization problem:

Eq.2: Optimization problem for SkipGram. The aim is to minimize the embedding representation by estimating the -log likelihood of context words given the mapping of a word v_i

From equation 2 is possible to describe vertices in the d-dimensional latent representation, so that similar vertices have a similar representation.

Mathematical Insights IV: Hierarchical Softmax

Looking closely at the SkipGram algorithm we can enlist the following “ingredients”:

  • a window of size w to slide across the input sequences
  • the word embedding latent dimension d
  • a 2-layer neural network, which receives the input corpus
  • an activation layer to compute the probability distribution of words

For each vertex vᵢ in a given random walk, we have a representation vector Φ(vᵢ). Given this vertex representation, we would like to maximize the probability of its neighbours in the random walk (eq.2). Such a posterior distribution can be learned through several algorithms. In DeepWalk the model uses the Hierarchical Softmax approach [12, 13], which alleviate the computational cost and speed up the training time, approximating the probability distributions. Computing the posterior probability can result in a huge effort, which equals the number of vertices in the graph (up to billions in some cases).

Hierarchical Softmax is based on a binary tree, the Huffman tree/coding, which ensures a balance between the distribution of points belonging to each side of any tree node. The vertex vᵢ from eq.2 is the root node. From there we can have branches created from the random walks till the last vertex. The final posterior probabilities are computed as the sum of the probability of each word we come across by transversing the tree from the root word.

Final remarks: visualise the embeddings

What will have at the end? Once DeepWalk has run, the Skipgram Word2Vec model will return the latent space description of our input graph, namely a graph-embedding file:

Fig.5: example of final embeddings from the Karate Club graph. The numbers 34 and 10 are the number of nodes of the graph and the latent representation dimension. For each vertex node (first number for each line, e.g. 34, 1, 33, 3, 2) we can find the associate latent representation coordinates

For each vertex, the file reports the model’s embeddings, namely the position of the vertex in the latent representation space. How can we use this representation? Well, we can use at least 2 decomposing algorithms to grab a “human” representation of such a space. The first is t-SNE approach [14] and the second is the principal component analysis (PCA). Fig.6 shows the PCA decomposition from the output embeddings for the Karate club. As you can see, DepeWalk can retrieve at least 3 main groups on how the Karate club can be subdivided, based on input features/relationships across nodes. This allows people to analyse huge graphs, e.g. Facebook pages, to understand social relationships across pages, without having headaches in dealing with representations and labelling.

Fig.6: PCA 2 dimensional representation of the Karate Club DeepWalk embeddings

Stay tuned for our next post: two practical application of DeepWalk to fully get the Python implementation 🙂


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: