Using AI to Generate Art

Original Source Here

Applied Machine Learning

Using AI to Generate Art

A Voice-Enabled Art Generation Tool

Image by Author

I’ve read and applied some of the latest AI research so you don’t have to. In this article I talk through the most advanced text-to-image generation models, and I apply them to make a voice-enabled art generation tool.


There are some groundbreaking papers that have flourished in the past few years in the field of text-to-image generation. This field is becoming more and more popular with new ways to monetize these artworks online.

Art generation models build heavily on some of my previous articles, particularly generative methods like Variational Auto Encoders (VAEs) and Generative Adversarial Networks (GANs).

As a bit of a spoiler, here are some of the images these amazing models can make:

A cool building in a green landscape (Image by Author)
a hand sketch (Image by Author)
sunset, boat and lake (Image by Author)

As you can see the results are amazing. But before I get to the app, first a bit of theory. I will first recap on VAEs. Then I’ll talk about how these have been adapted in VQ-VAEs and VQ-GANs. Finally, I’ll talk about OpenAI’s CLIP paper, and how it has changed the game when it comes to text-to-art.

VAEs Recap

Variational Auto Encoders are an integral part of how these models work. Autoencoders learn by training 2 neural networks. The first is an encoder, the second is a decoder.

Image by Author

The encoder is tasked with shrinking the dimensionality from an input image (the training data) to a lower-dimensional latent vector. The decoder’s job is to reconstruct the image as best as possible from the latent vector.

The difference between standard auto-encoders and VAEs is that instead of finding a latent vector, VAEs parametrize a latent space (most commonly isotropic Gaussians) by finding the mean and the standard deviation for example. By fitting this latent space to a Gaussian, we are then able to sample this gaussian very cheaply and feed the samples to the decoder. The decoder can then be used to generate new images that look like the training data.

Key takeaway: Autoencoders learn the mapping between latent vectors and image space. VAEs learn the mapping between a latent space and an image space.

ELBO (Image by Author)

VAEs try to maximize the ELBO (Evidence Lower Bound). If you want to see where this comes from check out this article. The first term is the reconstruction error, the second is the KL divergence between or posterior q(z|x) and the prior p(z).

  • q(z|x) is the posterior, the probability of seeing the latent vector given z given our input data x)
  • p(z) is the prior distribution, as I said earlier usually a Gaussian

Minimizing the second term is essentially a way to regularize the model in an attempt to always bring the space as close as possible to the prior (a Gaussian for example).

VQ-VAEs (Vector Quantisized VAEs)

Based on: Neural Discrete Representation Learning, by Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu. Nov 2017. [1]

A lot of the state-of-the-art AI models today such as DALL-E and VG-GANs are based on VQ-VAEs. Understanding this paper properly is fundamental. VQ-VAE is a model developed by Deep Mind that improves VAE ability to generate images.

There are two key differences between VAEs and VQ-VAEs. VQ-VAEs learn their prior, and the output of the encoder is discrete (instead of learning a latent distribution).

VQ-VAEs are still learning the mapping between a latent space and the image space, however, they discretize the latent space.

The reasoning behind a discrete output of the image is that language is inherently discrete and can be expressed as a sequence of discrete symbols. Images can be described through language, and hence the decoder should be able to reconstruct images from a discrete space.

Image from the Neural Discrete Representation Learning Paper [1]

VQ-VAEs first encode the image to a series of vectors using the encoder. The encoded vectors are then mapped to an embedded space using a “codebook”. The codebook is learnable, and it is the series of vectors shown in the diagram as the embedding space. One can think of the codebook as the dictionary and the discretized latent space as our language. Once the output of the model is mapped using the codebook, we essentially have a description of the input image. It is up to the decoder to take this encoded description, and reconstruct the original image.

So each vector from our encoded image is mapped to an embedded vector from the codebook. The way this is done is by choosing the closest vector in the codebook to each encoded vector (distance being the L2 norm).

Once these vectors are mapped to their discretized latent space, they are fed to the decoder, which hopefully reconstructs the input image accurately.

VQ-VAEs Loss [1]

Above is the loss term of VQ-VAEs. Comparing it to the VAE ELBO loss, you can see we have a similar reconstruction error. We no longer have KL (which disappears because of a uniform prior), instead, we have a second and third term. In the second term, we have a stop gradient that freezes the encoded vectors (z) and helps learn the codebook vectors (e). The third term is the opposite, it freezes the codebook vectors and helps learn the encoded vectors.

The decoder uses the first term only to be optimized, the encoder uses the first and last term, and the embedding (learning the codebook vectors) uses the middle term only.

Throughout training, the prior is kept uniform, however, after training they fit a new prior using PixelCNN.

VQ-GANs (Vector Quantisized GANs)

Based on: Taming Transformers for High-Resolution Image Synthesis, by Patrick Esser, Robin Rombach, Björn Ommer. Dec 2020. [2]

High-resolution synthetic image by Taming Transformers for High-Resolution Image Synthesis [2]

VQ-GANs are an improvement of VQ-VAEs. This paper is groundbreaking as it is the first to be able to generate coherent high-resolution images through the use of transformers. They are able to do this by applying the fundamental assumption that low-level structure in images can be described by local connectivity in images, whereas this method cannot be used for higher-level structure of the image.

These models use a combination of GANs and VAEs. As a quick reminder of GANs, these models learn in 2 parts. The first is the generator which generates images, the second is the discriminator, a classifier, that attempts to recognize whether the image you give it is real or fake. These models can learn together, essentially trying to trick each other. Once they are trained, you keep the generator to generate new images.

Image by Taming Transformers for High-Resolution Image Synthesis [2]

The image above shows the architecture of VQ-GANs. As you’ll probably realize, the bottom block is VQ-VAEs which has been integrated into this architecture, as it acts as the generator of the GAN.

VQ-GANs is essentially an improvement to VQ-VAE. The two major differences are that instead of using PixelCNN as their model to fit a prior, they fit a transformer (they used GPT-2 from OpenAI). The second difference is that they improved their loss function, this time using adversarial loss (using a GAN architecture).

You might notice in the image above that the discriminator output is a matrix rather than a binary output. This is because they are using a patch GAN. This takes overlapping patches of the image and acts like a normal discriminator would on each of the patches.

The transformer in the image is trained like a sequential language model, where the transformer tries to predict the sequence of encoded codebook vectors. The transformer is then used to learn the codebook embedding vectors.

CLIP (Contrastive Image Language Pre-Training)

Based on: Learning Transferable Visual Models From Natural Language Supervision by Alec Radford et al. Feb 2021.[3]

So far we have seen ways to generate images, but how do we connect this to text? OpenAI’s CLIP is a model that can determine which caption best goes with which image. It cannot generate images, but it is the connection between image generation and text.

Image by Learning Transferable Visual Models From Natural Language Supervision [3]

Above is the architecture of the CLIP model. There are three parts to the model shown in the image.

  1. Contrastive pre-training

The inputs to the model are image and description text pairs. The text and image inputs are each passed through their own encoder. Once this is done you can construct the contrastive space shown in the matrix above. The goal in training is to push the cosine similarities of the blue squares to 1, and the cosine similarity of all the other squares to zero. In this way we are teaching the text encoder and the image encoder to encode their inputs to a similar space.

2. Create dataset classifier from label text

The contrastive pre-training is not enough to determine whether the text and image are the same. We need some sort of classification to take place. For example, say that we have 2 images of a dog and a cat. These images might be very similar to each other. If we take the image of the dog, but the text says cat, we want CLIP to know that this image is not related to the text involved.

To apply this classifier functionality to the pre-trained model, they first need to build the classifier database. They do this by first constructing sample sentences from labeled image data. There are tones of labeled image datasets online, CIFAR-10, ImageNet etc. These images have categories assigned to them, not sentences. To get around this, they add the class to sentences.

For example, if you had an image of a dog, you could give the text encoder a sentence such as “A photo of a dog”. We do this to construct the embedding vector of each class in that dataset.

3. Use for zero-shot prediction

Once we have the learned embedding vectors, we can give the image encoder the corresponding image pair to the text, and we can compare the output of the encoder to our embeddings. The one with the largest cosine similarity will be the class.

Both the text and image encoder can be improved in this way, resulting in much better results.


So we have VQ-GANs, which can generate images based on a discrete space, and we have CLIP, which is able to score how well an image matches to text and vice versa.

When text is input to VQ-GAN-CLIP, VQ-GAN is first initialized with random noise, and produces an image. That image and encoded codebook vectors can then be fed to the CLIP image encoder, and assessed in terms of their similarity to the text input. The weights of VQ-GAN can then be updated accordingly to reflect the text input. This can be done recursively, for N iterations, until the output of the generator is a satisfactory representation of the text input.

Convergence of algorithm (GIF by Author)

There are more advanced models out there, such as DALL-E by OpenAI, and more recently DALL-E2. (The name DALL-E comes from a mix between Dalí and WALL-E, very clever). These models produce insanely sophisticated representations of the text inputs. However, they aren’t public so I’m not able to mess around with them, and I’m not able to show their work here. But you can check some of them out through this link.

My own Art Generation App

Now that I’ve gone through all the theory and the latest models, I want to apply them myself. Previously I have made my own Smart Assistant App. Here I want to adapt it to allow it to generate art using AI. You can check out some of the code to this app on my Github.

The idea is to be able to access these models through my voice. I want to use the smart assistant as normal, but if I give the model a “trigger word”, it will use the rest of the sentence as the prompt to generate a piece of art.

App Architecture (Image by Author)

To transcribe the audio to text, I used AssemblyAI’s Websocket api to transcribe my speech to text in a live manner. To execute the models, I used Google Cloud’s VertexAI.

To run the models and generate art, I need to execute the models as jobs on the cloud. I cannot run these models on my local machine as they really need GPUs. To run these models and generate images, I first installed the required libraries by setting up a “ file”. I then executed the jobs on a virtual machine with access to a virtual GPU.

The last thing to do was to configure the smart assistant to enable it to generate art. I incorporated the “trigger word” by simply adding an if statement, if the sentence starts with “Generate”, then I pass the rest of the sentence to the art generator model as a job to Vertex AI. Once the job is complete I save the image onto a bucket, which I can then access and load onto the app.

Image by Author

Here, I asked the bot to “generate a painting of a city skyline”. The trigger word is “generate”, the app then takes the rest of the sentence and passes it to the job request. I found this trigger word was reliable and worked well.

And that’s it, I now have a smart assistant app that can generate art for me on request.

Image by Author


In this article, I walk through the technical aspects of the latest AI research papers to generate images from text. I walk through basic generator methods architecture as a recap, such as VAEs (variational auto-encoders) and GANs (generative adversarial networks). I then talk about how these were adapted in Vector Quantisized research papers to improve their ability to generate images. In summary, these methods manage to discretize the state space, which, when used in combination with CLIP (a way to compare image similarity to text), these models are able to generate amazing pieces of art.

I am particularly excited to see how this field of research will impart to the digital art industry in the future, such as in video games or animated films, as I believe this technology has the potential to revolutionize these industries!

Support me

Hopefully, this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, and get access to all my articles and more:

Other articles you might enjoy

Support Vector Machines

Precision vs Recall

Bonus: How to launch jobs to Vertex AI

As a little extra, I want to include some info on how to deploy these models because I ran into a lot of difficulties and it isn’t as simple as it may seem.

These models run on a pre-built image on the cloud. On GCP there are lots of them, the one I used was “–10:latest”.

To get the models to run, however, I needed some more packages that were not present in the image. To get around this I passed a set of packages along with the custom job.

To pass the set of packages these had to be compiled into a tar.gz file, I used the “setuptools” Python library for this.

I then had to wrap the Python jobs as a module. This is done by adding files to the locations of the file, and modifying the python script slightly.

Here is the custom job function I used:

The job cannot be the same python script every time. This is because the caption changes each time I request a different painting. Each time I launch a job, I somehow need to pass that information to that particular job.

The way I did this was through FLAGS. FLAGS can be passed through the command line as options when executing the Python file. With the absl-py library I was able to read these and pass them to the art generation model.

Here is the file structure with the jobs I sent to VertexAI.

Hope this helps 🙂


[1] A. van den Oord, O. Vinyals and K. Kavukcuoglu, “Neural Discrete Representation Learning” (VQ-VAEs), Cornell University, 2017. Available:

[2] P. Esser, R. Rombach and B. Ommer, “Taming Transformers for High-Resolution Image Synthesis” (VQ-GANs), Cornell University, 2020. Available:

[3] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision” (CLIP), Cornell Univeristy, 2022. Available:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: