Hugging Face Just Released the Diffusers Library

Original Source Here

Hugging Face Just Released the Diffusers Library

Making diffusers models like DALL-E 2 and Imagen more accessible than ever

Edit by author, background photo by Michael Dziedzic on Unsplash

Hugging Face, the creators of the transformers library, have just released a whole new library for building diffuser models. If you’re not sure what diffuser models are, think of them as the not-so-secret secret behind some of the most talked about AI models this year.

Those insanely beautiful, creative, almost human art pieces you’ve seen all over the internet? They’re images from OpenAI’s DALL-E 2, Google’s Imagen, and Midjourney. All of those services generate their images using diffuser models.

A few DALL-E 2 generated images [1].

Now Hugging Face has released an open source library that focuses on diffusers. With it, we can download and begin generating images in just a few lines of code.

The new diffusers library has made these wildly complex models intuitive and simple. In this article we’ll explore the new library works and generate a few of our own images — and see how they compare to the state-of-the-art models mentioned above.

If you prefer video, you can watch the video walkthrough of the article below or via this link.

Getting Started

To begin we need to pip install diffusers and initialize a diffusion model or pipeline (typically consisting of preprocessing/encoding steps followed by a diffuser). We will use a text-to-image diffusion pipeline:

Now all we need to do is create a prompt and run it through our ldm pipeline. Taking inspiration from Hugging Face’s introductory notebooks, we will try to generate a painting of a squirrel eating a banana.

There we go, that was incredibly easy. Now the image isn’t crazy impressive like with DALL-E 2 but we did this with five lines of code, for free, with the very first release of a new library. If that isn’t cool, I don’t know what is.

Here’s another painting of a squirrel eating a banana:

Maybe it’s modern art?

Prompt Engineering

An interesting trend to have appeared since the release of the big three diffusion models (DALL-E 2, Imagen, and Midjourney) is the increased focus on something called “prompt engineering”.

Prompt engineer is what it says on the tin. Literally the “engineering” of prompts to achieve a desired result. For example, many people have found that adding “in 4K” or “rendered in Unity” can enhance the realism of images generated by the big three (despite none of them generating in 4K resolution).

What happens if we try the same with our simple diffuser model?

Every image is weird in some way or another, the placement of these bananas is certainly questionable. However, you have to give the model credit, the detail on some of these squirrels is fairly good, and look at that banana’s reflection in image #1.

Watch out photographers and painters, "CompVis/ldm-text2im-large-256" is coming for you.

When in Rome

I’m currently staying in Rome, which during the height of summer is a terrible idea. Nonetheless, the pizza is second-to-none and with no landmark more iconic than the Colosseum, I thought great, what would it look like if an Italian was eating pizza on top of that landmark?

Granted, we’re not sitting on top of the Colosseum here, but I appreciate the effort. The Colosseum itself looks great, despite a mismatch in sky color between the arches.

Other than weird hot-dog hands and a tiny pizza, our Italian eating pizza looks great. The choice of sunglasses gives him a 90s dad vibe.

It’s hard to judge with any confidence from this single prompt, but I thought it interesting to note that the model didn’t generate any images of women eating pizza, despite ~51% of Italians being women [2]. Nor did the model generate images of non-white males — however, I did not run the model enough to determine if the latter is statistically significant.

The effects of bias in this model and other future Hugging Face hosted models will undoubtably be an important focus for the present and future of the library.


Returning to the squirrels, trying to generate more abstract images such as “a giant squirrel destroying a city”, leads to mixed results:

For me it seems like the model struggles to blend two typically unrelated concepts, that of a (giant) squirrel, and a city. This behavior seems to be emphasized by these two images generated from the same prompt:

Here we can see either a city skyline, or a squirrel-like object in an environment more commonly associated with squirrels. After running these prompts several times, I found it switched between the two and never merged both together.

Just for fun, here’s what DALL-E 2 produces from the prompt "a dramatic shot of a giant squirrel destroying a modern city":

Generated by the author using OpenAI’s DALL-E 2.

These are all very impressive, but as alluded to already, we can’t expect comparable performance between these two options, for now.

That’s it for this first look at Hugging Face’s latest library. All things considered, I am massively excited to see this library develop. Right now, the best diffuser models are all locked behind closed doors, I view this framework as a key that could unlock some awesome levels of AI-boosted creativity.

That’s not to say this framework is near replacing DALL-E 2, Imagen, or Midjourney. It’s not, and the world is a better place with the greater variety of choices between commercial and open source offerings.

These open sourced models allow normal people like you or me to get our hands on some of the latest advances in deep learning. And when a lot of people experiment with new tech freely, cool things tend to happen.

I’m excited to see where this goes. If you’re interested in seeing more models in action, check out my video on diffusers:

It’d be great to keep in touch too! I post regularly on YouTube, and I’m active alongside many others interested in ML on Discord here.

Thanks for reading!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: