Original Source Here

# Variational Autoencoders and Bioinformatics

## Introduction to variational autoencoders and their application Bioinformatics/Metagenomics analysis

Autoencoders, in general, intend to learn a lower-dimensional representation of data. One of the main advantages of autoencoders is that they are capable of learning much complex lower dimensions whereas PCA-like decompositions are limited by their linear nature. Feel free to have a look at my article about auto-encoders.

As usual, I will be talking about an application in bioinformatics.

# Variational Autoencoders

In contrast to conventional auto-encoders (AEs), variational auto-encoders (VAEs) belong to the family of generative models. This is because the VAEs learn a latent distribution for the input data. So they are capable of reconstructing new data points given new points from those distributions. However, the generative aspect of VAEs is not quite often used since GANs are much better at that task.

One major importance of VAEs is the ability to learn different distributions in data (similar to Gaussian mixtures). These can help in clustering data that demonstrate different underlying distributions.

# Necessary Imports in Python

In order to make your next code scripts work we will need the following imports.

`import torch`

from torch.nn import functional as F

from torch import nn, optim

from torch.utils.data import DataLoader

from torch.utils.data.dataset import TensorDataset

import numpy as np

from tqdm import tqdm, trange

import seaborn as sns

import umap

import matplotlib.pyplot as plt

# The Architecture of a VAE

Similar to an AE, we have a bottleneck stage followed by reconstruction. More formally, we have an encoder and a decoder. Note the latent representation `VL`

and the following `sigma`

and `mu`

variables. The reconstruction is **Generated** from these distribution parameters. Our complete VAE class would look like below.

# Re-parameterization Trick

Note that, we have `sigma`

and `mu`

to pass gradients through. In other words, we shall set the VAE to learn proper `sigma`

and `mu`

given an input dataset. We achieve this using the reparameterization trick. That is;

`decoder_input = mu + epsilon * e ^ std`

Note that, the moment we chose to have `e^std`

, we are talking about log variance. Hence the terminology in code `logvar`

.

# Activation and Loss Functions

Note that we are using RELU activation at many places. However, we chose `softplus`

at `logvar`

latent variable. This is because log variance is always greater than zero. In the final reconstruction, we use `sigmoid`

since our input data dimensions range between 0 and 1.

Our loss function consists of two parts. The reconstruction loss (RL) and the KL-Divergence (KLD). We particularly use KLD in order to ensure that the learned distributions are as close as possible to a normal distribution (or Gaussian, or some distribution we like). You can read more here.

Reconstruction is a good old Mean Squared Error. Depending on the application this could differ. For example, black and white images (MNIST) could use Binary Crossentropy.

Finally, we can search proper weight hyper-parameters for the RL and KLD that supports the best clustering (or binning, etc).

# Running Example with PyTorch

Let us consider the following Metagenomics dataset from one of my recent papers. Long reads were simulated using the **SimLoRD** application.

Now the dataset was vectorized using the following tool;

The data loader can be designed as follows;

Function to train the VAE;

Initialization;

data = LOAD DATA

truth = LOAD GROUND TRUTH # for visualizationdevice = "cuda" if torch.cuda.is_available() else "cpu"model = VAE(data.shape[1], 8).to(device)

optimizer = optim.Adam(model.parameters(), lr=1e-2200)

Training;

train_loader = make_data_loader(data, batch_size=1024, drop_last=True, shuffle=True, device=device)epochs = 50train(model, train_loader, epochs, device)

Obtaining latent representations;

`with torch.no_grad():`

model.eval()

data = LOAD DATA

data = torch.from_numpy(data).float().to(device)

em, _ = model.encode(data)

Visualize;

import randomsidx = random.sample(range(len(data)), 10000) # just use 10000

em_2d = umap.UMAP().fit_transform(em.cpu().numpy()[sidx])plt.figure(figsize=(10,10))

sns.scatterplot(x=em_2d.T[0], y=em_2d.T[1], hue=truth[sidx])

plt.legend(bbox_to_anchor=(1.05, 1))

# Metagenomics Binning

Once, the VAE is trained we can obtain the latent representations. In this example, I used UMAP to project a sample of 10000 reads into 2D for visualization. It looked like the following.

As it looks, we can see blobs of data points here and there. One can easily use a tool like **HDBSCAN** to extract dense clusters from this.

# Conclusive Remarks

The original idea of using VAE for binning was presented by VAMB for binning assemblies. VAMB often demands a high contig count (>10000 or so). This is because you always need more data to perform better in deep learning. Considering all these challenges and opportunities, we developed our own tool LRBinner to bin metagenomics reads. There we always have millions of reads. The complete LRBinner tool is much complicated than what I have presented in this article. But the intuition and ideas remain the same. We have also used a different clustering algorithm too. If you’re interested have a look below.

Find the complete Jupyter Notebook here. Original PyTorch example code is here.

I hope you enjoyed reading this article. Have a nice day!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot