Conditional Generative Models to Create New Molecules

Original Source Here

Drug discovery using deep learning has attracted a lot of attention of late as it has obvious advantages like higher efficiency, less manual guessing and faster process time. In this paper, we present a novel neural network for generating small molecules similar to the ones in the training set. Our network consists of an encoder made up of bi-GRU layers for converting the input samples to a latent space, predictor for enhancing the capability of encoder made up of 1D-CNN layers and a decoder comprised of uni-GRU layers for reconstructing the samples from the latent space representation. Condition vector in latent space is used for generating molecules with the desired properties. We present the loss functions used for training our network, experimental details and property prediction metrics. Our network outperforms previous methods using Molecular weight, LogP and Quantitative Estimation of Drug-likeness as the evaluation metrics.


Deep learning has achieved tremendous success in many areas tackling a range of datasets from images to text. One of those areas is Chemoinformatics. Neural networks have been used to solve a variety of problems like molecule property prediction, drug design, chemical reaction prediction etc. In this work, we propose a novel network for de novo molecular design using deep learning. Neural networks clearly offers a better approach as it speeds up the process while also increasing the efficiency compared to the traditional methods.

The total number of potential organic molecules is very large while the chemical space contains more than molecules (Virshup et al., 2013). Of these only some molecules are discovered, All traditional methods discovered new molecules using the chemical space of the molecules which were already discovered(Kim et al., 2016). A lot of newly discovered molecules were born out of hit and trial methods.

The main challenge is to generate new molecules with desired property (molecular weight, toxicity, solubility etc). Deep learning optimizes the computational overhead to search for new molecules. This approach is not only fast but also cheaper and more efficient. Also the whole chemical space can be utilized to search for potential small molecules. The process of de novo molecular design can be separated into the following parts: 1. molecular generation; 2. approach to rank molecules; 3. function to optimize the molecular space in search of new molecules.

Important Points

* We present a novel network for generating new but similar molecules to the ones it is trained on.

* Our network can be divided into three parts: encoder which converts the training data into a latent space, predictor is used to enhance the function of encoder using another latent space conversion while the decoder is responsible for generating new molecules from the latent space.

* We present the loss functions, experimental details and evaluation metrics for property prediction.

* Our network outperforms previous methods on most of the metrics.


A sample of 100,0000 SMILES strings of drug like molecules was randomly sampled from ZINC database. We use 90,000 molecules for training and 10,000 molecules for testing the property prediction performance. A special sequence indicating end of sequence is appended at the end of every sequence. To evaluate the performance of our network, we used three properties molecular weight (MolWt), Wildman Crippen partition coefficient (LogP) and quantitative estimation of druglikeness (QED).

Molecular Representation

A lot of molecular representations have been used in literature. The most common among them are the SMILES (Weininger, 1988) and the Graph representation. The more detailed the representation is, the more computational burden it demands. Simplified Molecular Input Line Entry System (SMILES) is a one dimensional representation of a two dimensional chemical drawing. It contains atoms and bond symbols with an easy vocabulary. Since it is easy to understand and parse, hence Natural Language Processing (NLP) techniques can be used on them. More than one SMILES representation of a molecule is possible, however only one canonical form is used per molecule. The molecular latent space visualized in two dimensions using principal component analysis is shown in Figure 1:

Figure 1: 2D visualization of around 8000 molecules encoded into the latent space.


The following benchmarks was used to determine the performance of our network for generating molecules:

  1. Validity: It assesses whether the molecules generated are realistic or not. Examples of not valid molecules are one with wrong valency configuration or wrong SMILES syntax.
  2. Uniqueness: It assesses whether the molecules generated are different from one another or not.
  3. Novelty: It assesses whether the molecules generated are different from the ones in the training set or not.

Network Architecture

Our network was trained on SMILES from ChEMBL database. Since the context is in small molecular generation, hence only SMILES string with less than 120 characters were used. The data was divided into 80% training data and 20% testing data. Bayesian optimization was done to optimize the hyper-parameters like number of hidden layers, activation functions, learning rate etc. Our model is comprised of an encoder, predictor and decoder networks. The encoder in our network has three bi-GRU layers, flatten layer and a dense layer. The predictor is comprised of a dense layer and three 1D convolutional layers. The latent space dimensions were set to 292. The encoder was fed data from SMILES database after one hot encoding. The encoded data is sampled in the latent space using mean and standard deviation vectors. The latent vector produces new samples after passing through the decoder. The decoder has three uni-GRU layers followed by a dense and flatten layer. 4 The input variable x is generated from a generative distribution pθ(x|y, z), which is conditioned on the output variable y and latent variable z. The prior distributions are denoted by p(y) = N(y|σy,Py) and p(z) = N(z|θ, I). In our case, x denotes molecules while y denotes properties. Standard deviation is used on both y and z terms before passing through the decoder. Our network is shown in Figure 2:

Figure 2: Illustration of our network architecture

Experimental Details

An open source package named RDKit was used for testing the validity of the generated SMILES strings and calculating the properties of the molecules. Samples of dataset is drawn from ZINC database (Irwin et al., 2012). Learning rate in our experiments was set to 0.0001 with exponential decay of 0.99. The model was trained until it converged. The condition for generated molecules to be successful is if the target property of generated molecules was within 10% range of given target value. The molecules are encoded to the latent representation and gaussian noise is added to it. The standard deviation value was important to tune as a lower value generated molecules similar to the ones in the training set. On the other hand using a larger standard deviation generated molecules very different from the ones it was trained on. The optimal value of standard deviation was found to be 0.05. During training, we normalize each output variable to have a mean of 0 and standard deviation of 1. Batch size value of 50 was used along with ADAM as the optimizer. The property prediction performance is evaluated using mean absolute error (MAE) on the test set. The encoder, predictor and decoder networks consist of three hidden layers each having 50 gated recurrent units (GRU). The target values for MolWt, LogP, and QED are set as (250, 350, 450), (1.5, 3.0, 4.5), and (0.5, 0.7, 0.9), respectively.

Property Prediction

The fraction of invalid molecules using our network was less than 1%. The fraction of unique molecules generated is 90.2%. The average and standard deviation values are reported in Table 1 using MAE as the evaluation metric with the varying fractions of labeled molecules. Our network outperforms others in most of the cases. An important tool for evaluating the performance of network is done using properties distribution. The following three properties are used:

  1. Molecular weight (MW): It is the sum of atomic weights in a molecule. To figure out if the generated samples are biased towards lighter or heavier molecules histograms of molecular weight for the generated and test sets are plotted.
  2. LogP: It is the ratio of a chemical’s concentration in the octanol phase to its concentration in the aqueous phase.
  3. Quantitative Estimation of Drug-likeness (QED): It is a measure of how likely a molecule is a viable candidate for a drug. It’s value lies between 0 and 1 both included.


The property prediction performance with varying fractions of labeled molecules compared with networks: ECFP (Rogers and Hahn, 2010), GraphConv (Kearnes et al., 2016), VAE (Gómez-Bombarelli et al., 2018) and SSVAE (Kang and Cho, 2018) is shown in Table 1:

Sample of molecules generated using standard deviation value of 0.05 is presented in Figure 3:

Figure 3: A few randomly selected, generated molecules. Sampled molecules using different signature and standard deviation of 0.05

We demonstrate that our network can generate with target properties for Aspirin and Tamiflu. The condition vector was made up of custom values for target properties. The molecules generated by our network for Aspirin is shown in Figure 4:

Figure 4: Molecules generated by our network with the condition vector made of the five target properties of Aspirin

The molecules generated are considerably different from original molecules since the latent vectors were chosen randomly. The molecules generated by our network for Tamiflu is shown in Figure 5:

Figure 5: Molecules generated by our network with the condition vector made of the five target properties of Tamiflu


The generative models like VAE and CVAE are able to learn smooth latent representation of input data and perform interpolations on them. For carrying out the interpolations, starting and end points are needed. Both of these points should represent a molecule in chemical space. The interpolation samples between Aspirin and Paracetamol is shown in Figure 6:

Figure 6: Generated samples using interpolation.


Drug discovery or new molecule generation has garnered a lot of attention from the deep learning community. Since it a very important but challenging problem in Cheminformatics, hence a lot of work has been done using a variety of neural networks. In this paper, we propose a novel network for generating similar but new molecules to the one it has been trained on. The network is made up of three parts: encoder, predictor and decoder networks. We present the loss functions, molecular representation and experimental details. Using Molecular weight, LogP and Quantitative Estimation of Drug-likeness as the property prediction metrics, our network performs better than previous state of the art approaches.


  • N. De Cao and T. Kipf (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973.
  • G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, and N. Baker (2017) Chemception: a deep neural network with minimal chemistry knowledge matches the performance of expert-developed qsar/qspr models. arXiv preprint arXiv:1706.06689.
  • R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic chemical design using a data-driven continuous representation of molecules. Cited by: §2 ,§2 ,§5 .
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
  • S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, et al. (2016) PubChem substance and compound databases. Cited by: §1 .
  • K. Preuer, G. Klambauer, F. Rippmann, S. Hochreiter, and T. Unterthiner (2019) Interpretable deep learning in drug discovery. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 331–345.
  • M. H. Segler, T. Kogej, C. Tyrchan, and M. P. Waller (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. Cited by: §2 ,§2 .
  • M. Simonovsky and N. Komodakis (2018) Graphvae: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Cited by: §2 .
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §3.5 .

Before You Go




Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: