{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab PyG\n", "\n", "### Andrea Passerini, Antonio Longa \n", "### andrea.passerini@unitn.it, antonio.longa@unitn.it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pytorch Geometric (PyG) \n", "\n", "Pytorch Geometric (PyG) is a geometric deep learning extension library for PyTorch. It consists of various methods for deep learning on graphs and other irregular structures. It implements plenty of graph neural networks from the literature and allows to easily prototype new ones.\n", "\n", "Adapted from tutorials and notebooks from https://github.com/rusty1s/pytorch_geometric" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Message Passing Networks\n", "\n", "\n", "![title](img/img1.png)\n", "\n", "\n", "Graph neural networks can be defined in terms of a *neighborhood aggregation* or *message passing* scheme.\n", "With $\\mathbf{x}^{(k-1)}_i \\in \\mathbb{R}^F$ denoting node features of node $i$ in layer $(k-1)$ and $\\mathbf{e}_{j,i} \\in \\mathbb{R}^D$ denoting (optional) edge features from node $j$ to node $i$, message passing graph neural networks can be described as\n", "\n", "\n", "\n", "$$\n", " \\mathbf{x}_i^{(k)} = \\gamma^{(k)} \\left( \\mathbf{x}_i^{(k-1)}, \\square_{j \\in \\mathcal{N}(i)} \\, \\phi^{(k)}\\left(\\mathbf{x}_i^{(k-1)}, \\mathbf{x}_j^{(k-1)},\\mathbf{e}_{j,i}\\right) \\right),\n", "$$\n", "\n", "where $\\square$ denotes a differentiable, permutation invariant function, *e.g.*, sum, mean or max, and $\\gamma$ and $\\phi$ denote differentiable functions such as MLPs (Multi Layer Perceptrons)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The \"MessagePassing\" Base Class\n", "\n", "PyTorch Geometric provides the `MessagePassing` base class, which helps in creating such kinds of message passing graph neural networks by automatically taking care of message propagation.\n", "The user only has to define the functions $\\phi$ , *i.e.* `message`, and $\\gamma$ , *i.e.* `update`, as well as the aggregation scheme to use, *i.e.* `aggr=\"add\"`, `aggr=\"mean\"` or `aggr=\"max\"`.\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "This is done with the help of the following methods:\n", "\n", "\n", "* `MessagePassing(aggr=\"add\", flow=\"source_to_target\", node_dim=-2)`: Defines the aggregation scheme to use (`\"add\"`, `\"mean\"` or `\"max\"`) and the flow direction of message passing (either `\"source_to_target\"` or `\"target_to_source\"`).\n", " Furthermore, the `node_dim` attribute indicates along which axis to propagate.\n", "* `MessagePassing.propagate(edge_index, ...)`:\n", " The initial call to start propagating messages.\n", " Takes in the edge indices and all additional data which is needed to construct messages and to update node embeddings. \n", "* `MessagePassing.message(...)`: Constructs messages to node $i$ in analogy to $\\phi$ for each edge in $(j,i) \\in \\mathcal{E}$ if `flow=\"source_to_target\"` and $(i,j) \\in \\mathcal{E}$ if `flow=\"target_to_source\"`.\n", "* `MessagePassing.update(aggr_out, ...)`: Updates node embeddings in analogy to $\\gamma$ for each node $i \\in \\mathcal{V}$.\n", " Takes in the output of aggregation as first argument and any argument which was initially passed to `MessagePassing.propagate`.\n", "\n", "Let us verify this by re-implementing two popular GNN variants, the `GCN layer from Kipf and Welling `_ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Edges are represented as a sparse list of edge indices, where each column reports the coordinates of an edge: \n", "\n", "\\begin{equation}\n", "Adj = \n", "\\begin{pmatrix}\n", "1 & 1 & 0\\\\\n", "1 & 1 & 0\\\\\n", "0 & 0 & 1\n", "\\end{pmatrix}\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", "edgeIndex = \n", "\\begin{pmatrix}\n", "0 & 0 & 1 & 1 & 2\\\\\n", "0 & 1 & 0 & 1 & 2\n", "\\end{pmatrix}\n", "\\end{equation}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementing the GCN Layer\n", "\n", "The GCN layer from Kipf and Welling is a popular GNN that was the first \n", "to bring the idea of layerwise convolutions to graphs processing. The layer\n", "is mathematically defined as\n", "\n", "$$\n", " \\mathbf{x}_i^{(k)} = \\sum_{j \\in \\mathcal{N}(i) \\cup \\{ i \\}} \\frac{1}{\\sqrt{\\deg(i)} \\cdot \\sqrt{\\deg(j)}} \\cdot \\left( \\mathbf{\\Theta} \\cdot \\mathbf{x}_j^{(k-1)} \\right),\n", "$$\n", "\n", "where neighboring node features are first transformed by a weight matrix $\\mathbf{\\Theta}$, normalized by their degree, and finally summed up.\n", "This formula can be divided into the following steps:\n", "\n", "1. Add self-loops to the adjacency matrix.\n", "2. Linearly transform node feature matrix.\n", "3. Compute normalization coefficients.\n", "4. Normalize node features in $\\phi$.\n", "5. Sum up neighboring node features (`\"add\"` aggregation).\n", "\n", "Steps 1-3 are typically computed before message passing takes place.\n", "Steps 4-5 can be easily processed using the `MessagePassing` base class.\n", "The full layer implementation is shown below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install required packages.\n", "!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html\n", "!pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html\n", "!pip install -q torch-geometric\n", "\n", "import torch\n", "from torch_geometric.nn import MessagePassing\n", "from torch_geometric.utils import add_self_loops, degree\n", "\n", "class myGCNConv(MessagePassing):\n", " def __init__(self, in_channels, out_channels):\n", " super(myGCNConv, self).__init__(aggr='add') # \"Add\" aggregation (Step 5).\n", " self.lin = torch.nn.Linear(in_channels, out_channels)\n", "\n", " def forward(self, x, edge_index):\n", " # x has shape [N, in_channels]\n", " # edge_index has shape [2, E] (sparse adjacency matrix as a list of edges)\n", "\n", " # Step 1: Add self-loops to the adjacency matrix [edge list].\n", " edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))\n", "\n", " # Step 2: Linearly transform node feature matrix.\n", " x = self.lin(x)\n", "\n", " # Step 3: Compute normalization.\n", " row, col = edge_index # row and column indices of edges\n", " deg = degree(col, x.size(0), dtype=x.dtype) # just count number of repetitions per index \n", " deg_inv_sqrt = deg.pow(-0.5)\n", " deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0\n", " norm = deg_inv_sqrt[row] * deg_inv_sqrt[col] # row index deg_inv_sqrt times col index deg_inv_sqrt\n", "\n", " # Step 4-5: Start propagating messages.\n", " return self.propagate(edge_index, x=x, norm=norm)\n", "\n", " def message(self, x_j, norm):\n", " # x_j has shape [E, out_channels]\n", "\n", " # Step 4: Normalize node features.\n", " return norm.view(-1, 1) * x_j # broadcast normalization coefficient multiplication to all channels" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zF5bw3m9UrMy" }, "outputs": [], "source": [ "# Helper function for visualization.\n", "%matplotlib inline\n", "import torch\n", "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "def visualize(h, color, epoch=None, loss=None):\n", " plt.figure(figsize=(7,7))\n", " plt.xticks([])\n", " plt.yticks([])\n", "\n", " if torch.is_tensor(h):\n", " h = h.detach().cpu().numpy()\n", " plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap=\"Set2\")\n", " if epoch is not None and loss is not None:\n", " plt.xlabel(f'Epoch: {epoch}, Loss: {loss.item():.4f}', fontsize=16)\n", " else:\n", " nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,\n", " node_color=color, cmap=\"Set2\")\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "qoW2Z7P70LNQ" }, "source": [ "# Example: node classification\n", "\n", "Following [Kipf et al. (2017)](https://arxiv.org/abs/1609.02907), let's dive into the world of GNNs by looking at a simple graph-structured example, the well-known [**Zachary's karate club network**](https://en.wikipedia.org/wiki/Zachary%27s_karate_club). This graph describes a social network of 34 members of a karate club and documents links between members who interacted outside the club. Here, we are interested in detecting communities that arise from the member's interaction.\n", "\n", "PyTorch Geometric provides an easy access to this dataset via the [`torch_geometric.datasets`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets) subpackage:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "H_VTFHd0uFz6", "outputId": "e9472a23-e3c8-4ef2-86c1-64282d80de5d" }, "outputs": [], "source": [ "from torch_geometric.datasets import KarateClub\n", "\n", "dataset = KarateClub()\n", "print(f'Dataset: {dataset}:')\n", "print('======================')\n", "print(f'Number of graphs: {len(dataset)}')\n", "print(f'Number of features: {dataset.num_features}')\n", "print(f'Number of classes: {dataset.num_classes}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dataset.data)" ] }, { "cell_type": "markdown", "metadata": { "id": "7cjjyFVnpKB0" }, "source": [ "After initializing the [`KarateClub`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.KarateClub) dataset, we first can inspect some of its properties.\n", "For example, we can see that this dataset holds exactly **one graph**, and that each node in this dataset is assigned a **34-dimensional feature vector** (which uniquely describes the members of the karate club).\n", "Furthermore, the graph holds exactly **4 classes**, which represent the community each node belongs to.\n", "\n", "Let's now look at the underlying graph in more detail:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gUFSrDPxuQ23", "outputId": "98c222c0-d955-4607-8b0e-cb98c012cb79" }, "outputs": [], "source": [ "data = dataset[0] # Get the first graph object.\n", "\n", "print(data)\n", "print('==============================================================')\n", "\n", "# Gather some statistics about the graph.\n", "print(f'Number of nodes: {data.num_nodes}')\n", "print(f'Number of edges: {data.num_edges}')\n", "print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')\n", "print(f'Number of training nodes: {data.train_mask.sum()}')\n", "print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')\n", "print(f'Contains isolated nodes: {data.has_isolated_nodes()}')\n", "print(f'Contains self-loops: {data.has_self_loops()}')\n", "print(f'Is undirected: {data.is_undirected()}')" ] }, { "cell_type": "markdown", "metadata": { "id": "MY4pZma9p3Ax" }, "source": [ "Each graph in PyTorch Geometric is represented by a single [`Data`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data) object, which holds all the information to describe its graph representation.\n", "We can print the data object anytime via `print(data)` to receive a short summary about its attributes and their shapes:\n", "```\n", "Data(edge_index=[2, 156], x=[34, 34], y=[34], train_mask=[34])\n", "```\n", "We can see that this `data` object holds 4 attributes:\n", "(1) The `edge_index` property holds the information about the **graph connectivity**, *i.e.*, a tuple of source and destination node indices for each edge.\n", "PyG further refers to (2) **node features** as `x` (each of the 34 nodes is assigned a 34-dim feature vector), and to (3) **node labels** as `y` (each node is assigned to exactly one class).\n", "(4) There also exists an additional attribute called `train_mask`, which describes for which nodes we already know their community assigments.\n", "In total, we are only aware of the ground-truth labels of 4 nodes (one for each community), and the task is to infer the community assignment for the remaining nodes.\n", "\n", "\n", "We can further visualize the graph by converting it to the `networkx` library format, which implements, in addition to graph manipulation functionalities, powerful tools for visualization:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**to networkx**\n", "
\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 411 }, "id": "Y9MOs8iSwKFD", "outputId": "7bd4a2ef-08a4-458c-dfb5-09ea82fed059" }, "outputs": [], "source": [ "from torch_geometric.utils import to_networkx\n", "\n", "G = to_networkx(data,node_attrs=[\"x\"], to_undirected=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "\n", "G.nodes(data=True)[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G = to_networkx(data, to_undirected=True)\n", "\n", "visualize(G, color=data.y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What we did so far...**\n", "\n", "1) define our conv: myGCNConv \n", "2) download karate dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "kPbYXBn1yYIJ" }, "source": [ "## Implementing a GCN \n", "\n", "We implement a GCN as a `torch.nn.Module` class that contains a sequence of GCNConv layers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AkQAVluLuxT_", "outputId": "b049292d-b2da-4926-f974-776012a3ec60" }, "outputs": [], "source": [ "import torch\n", "from torch.nn import Linear\n", "#from torch_geometric.nn import GCNConv\n", "\n", "class GCN(torch.nn.Module):\n", " def __init__(self):\n", " super(GCN, self).__init__()\n", " torch.manual_seed(112233)\n", " self.conv1 = myGCNConv(dataset.num_features, 4)\n", " self.conv2 = myGCNConv(4, 4)\n", " self.conv3 = myGCNConv(4, 2)\n", " self.classifier = Linear(2, dataset.num_classes)\n", "\n", " def forward(self, x, edge_index):\n", " h = self.conv1(x, edge_index)\n", " h = h.tanh()\n", " h = self.conv2(h, edge_index)\n", " h = h.tanh()\n", " h = self.conv3(h, edge_index)\n", " h = h.tanh() # Final GNN embedding space.\n", " \n", " # Apply a final (linear) classifier.\n", " out = self.classifier(h)\n", "\n", " return out, h\n", "\n", "model = GCN()\n", "print(model)" ] }, { "cell_type": "markdown", "metadata": { "id": "hjsb3Fst2P8k" }, "source": [ "Here, we first initialize all of our building blocks in `__init__` and define the computation flow of our network in `forward`.\n", "We first define and stack **three graph convolution layers**, which corresponds to aggregating 3-hop neighborhood information around each node (all nodes up to 3 \"hops\" away).\n", "In addition, the `myGCNConv` layers reduce the node feature dimensionality to $2$, *i.e.*, $34 \\rightarrow 4 \\rightarrow 4 \\rightarrow 2$. Each `myGCNConv` layer is enhanced by a [tanh](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html?highlight=tanh#torch.nn.Tanh) non-linearity.\n", "\n", "After that, we apply a single linear transformation ([`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html?highlight=linear#torch.nn.Linear)) that acts as a classifier to map our nodes to 1 out of the 4 classes/communities.\n", "\n", "We return both the output of the final classifier as well as the final node embeddings produced by our GNN.\n", "We proceed to initialize our final model via `GCN()`, and printing our model produces a summary of all its used sub-modules.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Dt8fPEmc4m5l" }, "source": [ "### Embedding the Karate Club Network\n", "\n", "Let's take a look at the node embeddings produced by our GNN.\n", "Here, we pass in the initial node features `x` and the graph connectivity information `edge_index` to the model, and visualize its 2-dimensional embedding." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 433 }, "id": "nwHtX5siwe2v", "outputId": "080bbcb1-8278-4dc5-dd4c-f163e9bb8372" }, "outputs": [], "source": [ "model = GCN()\n", "\n", "out, h = model(data.x, data.edge_index)\n", "print(f'Embedding shape: {list(h.shape)}')\n", "\n", "visualize(h, color=data.y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "out #.argmax(-1)\n", "\n", "out.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "1-W5Kfhu5I-P" }, "source": [ "Remarkably, even before training the weights of our model, the model produces an embedding of nodes that closely resembles the community-structure of the graph.\n", "Nodes of the same color (community) are already closely clustered together in the embedding space, although the weights of our model are initialized **completely at random** and we have not yet performed any training so far!\n", "This leads to the conclusion that GNNs introduce a strong inductive bias, leading to similar embeddings for nodes that are close to each other in the input graph.\n", "\n", "### Training on the Karate Club Network\n", "\n", "But can we do better? Let's look at an example on how to train our network parameters based on the knowledge of the community assignments of 4 nodes in the graph (one for each community):\n", "\n", "Since everything in our model is differentiable and parameterized, we can add some labels, train the model and observe how the embeddings react.\n", "Here, we make use of a semi-supervised or transductive learning procedure: We simply train with one labelled node per class, but are allowed to use the complete input graph data.\n", "\n", "\n", "\n", "Training our model is very similar to any other PyTorch model.\n", "In addition to defining our network architecture, we define a loss criterion (here, [`CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)) and initialize a stochastic gradient optimizer (here, [`Adam`](https://pytorch.org/docs/stable/optim.html?highlight=adam#torch.optim.Adam)).\n", "\n", "Note that our semi-supervised learning scenario is achieved by the following line:\n", "```\n", "loss = criterion(out[data.train_mask], data.y[data.train_mask])\n", "```\n", "While we compute node embeddings for all of our nodes, we **only use the training nodes for computing the loss**.\n", "Here, this is implemented by filtering the output of the classifier `out` and ground-truth labels `data.y` to only contain the nodes in the `train_mask`.\n", "\n", "Let us now start training and see how our node embeddings evolve over time (best experienced by explicitely running the code):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 430 }, "id": "etxOsz8QIbMO", "outputId": "a961884c-7520-460a-d591-94834563c4a2", "scrolled": true }, "outputs": [], "source": [ "import time\n", "#from IPython.display import Javascript # Restrict height of output cell.\n", "#display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 430})'''))\n", "\n", "model = GCN()\n", "criterion = torch.nn.CrossEntropyLoss() # Define loss criterion.\n", "optimizer = torch.optim.Adam(model.parameters(), lr=0.01) # Define optimizer.\n", "\n", "def train(data):\n", " optimizer.zero_grad() # Clear gradients.\n", " out, h = model(data.x, data.edge_index) # Perform a single forward pass.\n", " loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes.\n", " loss.backward() # Derive gradients.\n", " optimizer.step() # Update parameters based on gradients.\n", " return loss, h\n", "\n", "for epoch in range(401):\n", " loss, h = train(data)\n", " if epoch % 10 == 0:\n", " visualize(h, color=data.y, epoch=epoch, loss=loss)\n", " time.sleep(0.3)" ] }, { "cell_type": "markdown", "metadata": { "id": "NgcpV4rjAWy-" }, "source": [ "As one can see, our 3-layer GCN model manages to linearly separating the communities and classifying most of the nodes correctly.\n", "\n", "Note that we do not need to reimplement standard GNN architectures, the library provides implementations for the most popular ones, including the GCN Layer:\n", "\n", "`torch_geometric.nn.GCNConv`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example: graph classification \n", "\n", "Let's now have a closer look at how to apply **Graph Neural Networks (GNNs) to the task of graph classification**.\n", "Graph classification refers to the problem of classifiying entire graphs (in contrast to nodes), given a **dataset of graphs**, based on some structural graph properties.\n", "Here, we want to embed entire graphs, and we want to embed those graphs in such a way so that they are linearly separable given a task at hand." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Screen Shot 2020-08-27 at 13.13.26.png]()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most common task for graph classification is **molecular property prediction**, in which molecules are represented as graphs, and the task may be to infer whether a molecule inhibits HIV virus replication or not.\n", "\n", "The TU Dortmund University has collected a wide range of different graph classification datasets, known as the [**TUDatasets**](https://chrsmrrs.github.io/datasets/), which are also accessible via [`torch_geometric.datasets.TUDataset`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.TUDataset) in PyTorch Geometric.\n", "Let's load and inspect one of the smaller ones, the **MUTAG dataset**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torch_geometric.datasets import TUDataset\n", "\n", "dataset = TUDataset(root='data/TUDataset', name='MUTAG')\n", "\n", "print()\n", "print(f'Dataset: {dataset}:')\n", "print('====================')\n", "print(f'Number of graphs: {len(dataset)}')\n", "print(f'Number of features: {dataset.num_features}')\n", "print(f'Number of classes: {dataset.num_classes}')\n", "\n", "data = dataset[0] # Get the first graph object.\n", "\n", "print()\n", "print(data)\n", "print('=============================================================')\n", "\n", "# Gather some statistics about the first graph.\n", "print(f'Number of nodes: {data.num_nodes}')\n", "print(f'Number of edges: {data.num_edges}')\n", "print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')\n", "print(f'Contains isolated nodes: {data.contains_isolated_nodes()}')\n", "print(f'Contains self-loops: {data.contains_self_loops()}')\n", "print(f'Is undirected: {data.is_undirected()}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset provides **188 different graphs**, and the task is to classify each graph into **one out of two classes**.\n", "\n", "By inspecting the first graph object of the dataset, we can see that it comes with **17 nodes (with 7-dimensional feature vectors)** and **38 edges** (leading to an average node degree of 2.24).\n", "It also comes with exactly **one graph label** (`y=[1]`), and, in addition to previous datasets, provides addtional **4-dimensional edge features** (`edge_attr=[38, 4]`).\n", "However, for the sake of simplicity, we will not use edge features.\n", "\n", "PyTorch Geometric provides some useful utilities for working with graph datasets, *e.g.*, we can shuffle the dataset and use the first 150 graphs as training graphs, while using the remaining ones for testing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "torch.manual_seed(12345)\n", "dataset = dataset.shuffle()\n", "\n", "train_dataset = dataset[:150]\n", "test_dataset = dataset[150:]\n", "\n", "print(f'Number of training graphs: {len(train_dataset)}')\n", "print(f'Number of test graphs: {len(test_dataset)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mini-batching of graphs\n", "\n", "Since graphs in graph classification datasets are usually small, a good idea is to **batch the graphs** before inputting them into a Graph Neural Network to guarantee full GPU utilization.\n", "In the image or language domain, this procedure is typically achieved by **rescaling** or **padding** each example into a set of equally-sized shapes, and examples are then grouped in an additional dimension.\n", "The length of this dimension is then equal to the number of examples grouped in a mini-batch and is typically referred to as the `batch_size`.\n", "\n", "**NOT A GOOD IDEA**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, for GNNs the two approaches described above are either not feasible or may result in a lot of unnecessary memory consumption.\n", "Therefore, PyTorch Geometric opts for another approach to achieve parallelization across a number of examples. Here, adjacency matrices are stacked in a diagonal fashion (creating a giant graph that holds multiple isolated subgraphs), and node and target features are simply concatenated in the node dimension:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Screen Shot 2020-08-27 at 13.11.53.png]()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This procedure has some crucial advantages over other batching procedures:\n", "\n", "1. GNN operators that rely on a message passing scheme do not need to be modified since messages are not exchanged between two nodes that belong to different graphs.\n", "\n", "2. There is no computational or memory overhead since adjacency matrices are saved in a sparse fashion holding only non-zero entries, *i.e.*, the edges.\n", "\n", "PyTorch Geometric automatically takes care of **batching multiple graphs into a single giant graph** with the help of the [`torch_geometric.data.DataLoader`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.DataLoader) class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from torch_geometric.data import DataLoader\n", "\n", "train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)\n", "test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)\n", "\n", "for step, data in enumerate(train_loader):\n", " print(f'Step {step + 1}:')\n", " print('=======')\n", " print(f'Number of graphs in the current batch: {data.num_graphs}')\n", " print(data)\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# test batch_size = 147\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show G" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show adj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a Graph Neural Network (GNN)\n", "\n", "Training a GNN for graph classification usually follows a simple recipe:\n", "\n", "1. Embed each node by performing multiple rounds of message passing\n", "2. Aggregate node embeddings into a unified graph embedding (**readout layer**)\n", "3. Train a final classifier on the graph embedding\n", "\n", "There exists multiple **readout layers** in literature, but the most common one is to simply take the average of node embeddings:\n", "\n", "$$\n", "\\mathbf{x}_{\\mathcal{G}} = \\frac{1}{|\\mathcal{V}|} \\sum_{v \\in \\mathcal{V}} \\mathcal{x}^{(L)}_v\n", "$$\n", "\n", "PyTorch Geometric provides this functionality via [`torch_geometric.nn.global_mean_pool`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.glob.global_mean_pool), which takes in the node embeddings of all nodes in the mini-batch and the assignment vector `batch` to compute a graph embedding of size `[batch_size, hidden_channels]` for each graph in the batch.\n", "\n", "The final architecture for applying GNNs to the task of graph classification then looks as follows and allows for complete end-to-end training:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from torch.nn import Linear\n", "import torch.nn.functional as F\n", "from torch_geometric.nn import GCNConv\n", "from torch_geometric.nn import global_mean_pool\n", "\n", "class GCN(torch.nn.Module):\n", " def __init__(self, hidden_channels):\n", " super(GCN, self).__init__()\n", " torch.manual_seed(12345678)\n", " self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)\n", " self.conv2 = GCNConv(hidden_channels, hidden_channels)\n", " self.conv3 = GCNConv(hidden_channels, hidden_channels)\n", " self.lin = Linear(hidden_channels, dataset.num_classes)\n", "\n", " def forward(self, x, edge_index, batch):\n", " # 1. Obtain node embeddings \n", " x = self.conv1(x, edge_index)\n", " x = x.relu()\n", " x = self.conv2(x, edge_index)\n", " x = x.relu()\n", " x = self.conv3(x, edge_index)\n", " \n", " # 2. Readout layer\n", " x = global_mean_pool(x, batch) # [batch_size, hidden_channels]\n", " \n", " # 3. Apply a final classifier\n", " x = F.dropout(x, p=0.5, training=self.training)\n", " x = self.lin(x)\n", " \n", " return x\n", "\n", "model = GCN(hidden_channels=64)\n", "print(model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we again make use of the [`GCNConv`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) with $\\mathrm{ReLU}(x) = \\max(x, 0)$ activation for obtaining localized node embeddings, before we apply our final classifier on top of a graph readout layer.\n", "\n", "Let's train our network for a few epochs to see how well it performs on the training as well as test set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#from IPython.display import Javascript\n", "#display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))\n", "\n", "model = GCN(hidden_channels=64)\n", "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n", "criterion = torch.nn.CrossEntropyLoss()\n", "\n", "def train():\n", " model.train()\n", "\n", " for data in train_loader: # Iterate in batches over the training dataset.\n", " out = model(data.x, data.edge_index, data.batch) # Perform a single forward pass.\n", " loss = criterion(out, data.y) # Compute the loss.\n", " loss.backward() # Derive gradients.\n", " optimizer.step() # Update parameters based on gradients.\n", " optimizer.zero_grad() # Clear gradients.\n", "\n", "def test(loader):\n", " model.eval()\n", "\n", " correct = 0\n", " for data in loader: # Iterate in batches over the training/test dataset.\n", " out = model(data.x, data.edge_index, data.batch) \n", " pred = out.argmax(dim=1) # Use the class with highest probability.\n", " correct += int((pred == data.y).sum()) # Check against ground-truth labels.\n", " return correct / len(loader.dataset) # Derive ratio of correct predictions.\n", "\n", "\n", "for epoch in range(1, 51):\n", " train()\n", " train_acc = test(train_loader)\n", " test_acc = test(test_loader)\n", " if epoch % 10 == 0:\n", " print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As one can see, our model reaches around **76% test accuracy**.\n", "Reasons for the fluctations in accuracy can be explained by the rather small dataset (only 38 test graphs), and usually disappear once one applies GNNs to larger datasets.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Other datasets?\n", "\n", "\n", "Doc https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html \n", "OGB https://ogb.stanford.edu/docs/dataset_overview/\n", "\n", "# Other architectures?\n", "\n", "Doc https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#convolutional-layers\n", "\n", "# Other resources?\n", "Doc https://pytorch-geometric.readthedocs.io/en/latest/notes/colabs.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# let's move higher order [super fast]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from torch.nn import Linear\n", "import torch.nn.functional as F\n", "from torch_geometric.nn import GraphConv, GINConv, GINEConv, GATConv # ......\n", "from torch_geometric.nn import global_mean_pool\n", "\n", "class GCN(torch.nn.Module):\n", " def __init__(self, hidden_channels):\n", " super(GCN, self).__init__()\n", " torch.manual_seed(12345678)\n", " self.conv1 = GraphConv(dataset.num_node_features, hidden_channels)\n", " self.conv2 = GraphConv(hidden_channels, hidden_channels)\n", " self.lin = Linear(hidden_channels, dataset.num_classes)\n", "\n", " def forward(self, x, edge_index, batch):\n", " # 1. Obtain node embeddings \n", " x = self.conv1(x, edge_index)\n", " x = x.relu()\n", " x = self.conv2(x, edge_index)\n", " \n", " # 2. Readout layer\n", " x = global_mean_pool(x, batch) # [batch_size, hidden_channels]\n", " \n", " # 3. Apply a final classifier\n", " x = F.dropout(x, p=0.5, training=self.training)\n", " x = self.lin(x)\n", " \n", " return x\n", "\n", "model = GCN(hidden_channels=64)\n", "print(model)\n", "\n", "\n", "#from IPython.display import Javascript\n", "#display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))\n", "\n", "model = GCN(hidden_channels=64)\n", "optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n", "criterion = torch.nn.CrossEntropyLoss()\n", "\n", "def train():\n", " model.train()\n", "\n", " for data in train_loader: # Iterate in batches over the training dataset.\n", " out = model(data.x, data.edge_index, data.batch) # Perform a single forward pass.\n", " loss = criterion(out, data.y) # Compute the loss.\n", " loss.backward() # Derive gradients.\n", " optimizer.step() # Update parameters based on gradients.\n", " optimizer.zero_grad() # Clear gradients.\n", "\n", "def test(loader):\n", " model.eval()\n", "\n", " correct = 0\n", " for data in loader: # Iterate in batches over the training/test dataset.\n", " out = model(data.x, data.edge_index, data.batch) \n", " pred = out.argmax(dim=1) # Use the class with highest probability.\n", " correct += int((pred == data.y).sum()) # Check against ground-truth labels.\n", " return correct / len(loader.dataset) # Derive ratio of correct predictions.\n", "\n", "\n", "for epoch in range(1, 51):\n", " train()\n", " train_acc = test(train_loader)\n", " test_acc = test(test_loader)\n", " if epoch % 10 == 0:\n", " print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "1. Introduction.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 1 }