Original Source Here
Variational Autoencoders and Bioinformatics
Introduction to variational autoencoders and their application Bioinformatics/Metagenomics analysis
Autoencoders, in general, intend to learn a lower-dimensional representation of data. One of the main advantages of autoencoders is that they are capable of learning much complex lower dimensions whereas PCA-like decompositions are limited by their linear nature. Feel free to have a look at my article about auto-encoders.
As usual, I will be talking about an application in bioinformatics.
In contrast to conventional auto-encoders (AEs), variational auto-encoders (VAEs) belong to the family of generative models. This is because the VAEs learn a latent distribution for the input data. So they are capable of reconstructing new data points given new points from those distributions. However, the generative aspect of VAEs is not quite often used since GANs are much better at that task.
One major importance of VAEs is the ability to learn different distributions in data (similar to Gaussian mixtures). These can help in clustering data that demonstrate different underlying distributions.
Necessary Imports in Python
In order to make your next code scripts work we will need the following imports.
from torch.nn import functional as F
from torch import nn, optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import TensorDataset
import numpy as np
from tqdm import tqdm, trange
import seaborn as sns
import matplotlib.pyplot as plt
The Architecture of a VAE
Similar to an AE, we have a bottleneck stage followed by reconstruction. More formally, we have an encoder and a decoder. Note the latent representation
VL and the following
mu variables. The reconstruction is Generated from these distribution parameters. Our complete VAE class would look like below.
Note that, we have
mu to pass gradients through. In other words, we shall set the VAE to learn proper
mu given an input dataset. We achieve this using the reparameterization trick. That is;
decoder_input = mu + epsilon * e ^ std
Note that, the moment we chose to have
e^std, we are talking about log variance. Hence the terminology in code
Activation and Loss Functions
Note that we are using RELU activation at many places. However, we chose
logvar latent variable. This is because log variance is always greater than zero. In the final reconstruction, we use
sigmoid since our input data dimensions range between 0 and 1.
Our loss function consists of two parts. The reconstruction loss (RL) and the KL-Divergence (KLD). We particularly use KLD in order to ensure that the learned distributions are as close as possible to a normal distribution (or Gaussian, or some distribution we like). You can read more here.
Reconstruction is a good old Mean Squared Error. Depending on the application this could differ. For example, black and white images (MNIST) could use Binary Crossentropy.
Finally, we can search proper weight hyper-parameters for the RL and KLD that supports the best clustering (or binning, etc).
Running Example with PyTorch
Let us consider the following Metagenomics dataset from one of my recent papers. Long reads were simulated using the SimLoRD application.
Now the dataset was vectorized using the following tool;
The data loader can be designed as follows;
Function to train the VAE;
data = LOAD DATA
truth = LOAD GROUND TRUTH # for visualizationdevice = "cuda" if torch.cuda.is_available() else "cpu"model = VAE(data.shape, 8).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-2200)
train_loader = make_data_loader(data, batch_size=1024, drop_last=True, shuffle=True, device=device)epochs = 50train(model, train_loader, epochs, device)
Obtaining latent representations;
data = LOAD DATA
data = torch.from_numpy(data).float().to(device)
em, _ = model.encode(data)
import randomsidx = random.sample(range(len(data)), 10000) # just use 10000
em_2d = umap.UMAP().fit_transform(em.cpu().numpy()[sidx])plt.figure(figsize=(10,10))
sns.scatterplot(x=em_2d.T, y=em_2d.T, hue=truth[sidx])
Once, the VAE is trained we can obtain the latent representations. In this example, I used UMAP to project a sample of 10000 reads into 2D for visualization. It looked like the following.
As it looks, we can see blobs of data points here and there. One can easily use a tool like HDBSCAN to extract dense clusters from this.
The original idea of using VAE for binning was presented by VAMB for binning assemblies. VAMB often demands a high contig count (>10000 or so). This is because you always need more data to perform better in deep learning. Considering all these challenges and opportunities, we developed our own tool LRBinner to bin metagenomics reads. There we always have millions of reads. The complete LRBinner tool is much complicated than what I have presented in this article. But the intuition and ideas remain the same. We have also used a different clustering algorithm too. If you’re interested have a look below.
I hope you enjoyed reading this article. Have a nice day!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot