Variational Autoencoders and Bioinformatics



Original Source Here

Variational Autoencoders and Bioinformatics

Introduction to variational autoencoders and their application Bioinformatics/Metagenomics analysis

Autoencoders, in general, intend to learn a lower-dimensional representation of data. One of the main advantages of autoencoders is that they are capable of learning much complex lower dimensions whereas PCA-like decompositions are limited by their linear nature. Feel free to have a look at my article about auto-encoders.

Photo by Hitesh Choudhary on Unsplash

As usual, I will be talking about an application in bioinformatics.

Variational Autoencoders

In contrast to conventional auto-encoders (AEs), variational auto-encoders (VAEs) belong to the family of generative models. This is because the VAEs learn a latent distribution for the input data. So they are capable of reconstructing new data points given new points from those distributions. However, the generative aspect of VAEs is not quite often used since GANs are much better at that task.

One major importance of VAEs is the ability to learn different distributions in data (similar to Gaussian mixtures). These can help in clustering data that demonstrate different underlying distributions.

Necessary Imports in Python

In order to make your next code scripts work we will need the following imports.

import torch
from torch.nn import functional as F
from torch import nn, optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import TensorDataset
import numpy as np
from tqdm import tqdm, trange
import seaborn as sns
import umap
import matplotlib.pyplot as plt

The Architecture of a VAE

Diagram by Author

Similar to an AE, we have a bottleneck stage followed by reconstruction. More formally, we have an encoder and a decoder. Note the latent representation VL and the following sigma and mu variables. The reconstruction is Generated from these distribution parameters. Our complete VAE class would look like below.

Image by Author

Re-parameterization Trick

Note that, we have sigma and mu to pass gradients through. In other words, we shall set the VAE to learn proper sigma and mu given an input dataset. We achieve this using the reparameterization trick. That is;

decoder_input = mu + epsilon * e ^ std

Note that, the moment we chose to have e^std, we are talking about log variance. Hence the terminology in code logvar.

Activation and Loss Functions

Note that we are using RELU activation at many places. However, we chose softplus at logvar latent variable. This is because log variance is always greater than zero. In the final reconstruction, we use sigmoid since our input data dimensions range between 0 and 1.

Image by Author

Our loss function consists of two parts. The reconstruction loss (RL) and the KL-Divergence (KLD). We particularly use KLD in order to ensure that the learned distributions are as close as possible to a normal distribution (or Gaussian, or some distribution we like). You can read more here.

Reconstruction is a good old Mean Squared Error. Depending on the application this could differ. For example, black and white images (MNIST) could use Binary Crossentropy.

Finally, we can search proper weight hyper-parameters for the RL and KLD that supports the best clustering (or binning, etc).

Running Example with PyTorch

Let us consider the following Metagenomics dataset from one of my recent papers. Long reads were simulated using the SimLoRD application.

LRBinner dataset (License CC)

Now the dataset was vectorized using the following tool;

The data loader can be designed as follows;

Image by Author

Function to train the VAE;

Image by Author

Initialization;

data = LOAD DATA
truth = LOAD GROUND TRUTH # for visualization
device = "cuda" if torch.cuda.is_available() else "cpu"model = VAE(data.shape[1], 8).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-2200)

Training;

train_loader = make_data_loader(data, batch_size=1024, drop_last=True, shuffle=True, device=device)epochs = 50train(model, train_loader, epochs, device)

Obtaining latent representations;

with torch.no_grad():
model.eval()
data = LOAD DATA
data = torch.from_numpy(data).float().to(device)
em, _ = model.encode(data)

Visualize;

import randomsidx = random.sample(range(len(data)), 10000) # just use 10000
em_2d = umap.UMAP().fit_transform(em.cpu().numpy()[sidx])
plt.figure(figsize=(10,10))
sns.scatterplot(x=em_2d.T[0], y=em_2d.T[1], hue=truth[sidx])
plt.legend(bbox_to_anchor=(1.05, 1))

Metagenomics Binning

Once, the VAE is trained we can obtain the latent representations. In this example, I used UMAP to project a sample of 10000 reads into 2D for visualization. It looked like the following.

Image by Author

As it looks, we can see blobs of data points here and there. One can easily use a tool like HDBSCAN to extract dense clusters from this.

Conclusive Remarks

The original idea of using VAE for binning was presented by VAMB for binning assemblies. VAMB often demands a high contig count (>10000 or so). This is because you always need more data to perform better in deep learning. Considering all these challenges and opportunities, we developed our own tool LRBinner to bin metagenomics reads. There we always have millions of reads. The complete LRBinner tool is much complicated than what I have presented in this article. But the intuition and ideas remain the same. We have also used a different clustering algorithm too. If you’re interested have a look below.

Find the complete Jupyter Notebook here. Original PyTorch example code is here.

I hope you enjoyed reading this article. Have a nice day!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: