DALL·E 2, Explained: The Promise and Limitations of a Revolutionary AI



Original Source Here

DALL·E 2, Explained: The Promise and Limitations of a Revolutionary AI

With images and videos that you most likely haven’t seen.

“Vibrant portrait painting of Salvador Dalí with a robotic half face.” Credit: OpenAI

DALL·E 2 is the newest AI model by OpenAI. If you’ve seen some of its creations and think they’re amazing, keep reading to understand why you’re totally right — but also wrong.

OpenAI published a blog post and a paper entitled “Hierarchical Text-Conditional Image Generation with CLIP Latents” on DALL·E 2. The post is fine if you want to get a glimpse at the results and the paper is great for understanding the technical details, but neither explains DALL·E 2’s amazingness — and the not-so-amazing — in depth. That’s what this article is for.

DALL·E 2 is the new version of DALL·E, a generative language model that takes sentences and creates corresponding original images. At 3.5B parameters, DALL·E 2 is a large model but not nearly as large as GPT-3 and, interestingly, smaller than its predecessor (12B). Despite its size, DALL·E 2 generates 4x better resolution images than DALL·E and it’s preferred by human judges +70% of the time both in caption matching and photorealism.

As they did with DALL·E, OpenAI didn’t release DALL·E 2 (you can always join the never-ending waitlist). However, they open-sourced CLIP which, although only indirectly related to DALL·E, forms the basis of DALL·E 2. (CLIP is also the basis of the apps and notebooks people who can’t access DALL·E 2 are using.) Still, OpenAI’s CEO, Sam Altman, said they’ll eventually release DALL·E models through their API — for now, only a few selected have access to it (they’re opening the model to 1000 people each week).

This is surely not the first DALL·E 2 article you see, but I promise to not bore you. I’ll give you new insights to ponder and will add depth to ideas others have touched on only superficially. Also, I’ll go light on this one (although it’s quite long), so don’t expect a highly technical article — DALL·E 2’s beauty lies in its intersection with the real world, not in its weights and parameters.

And it’s at the intersection of AI with the real world where I focus my Substack newsletter, The Algorithmic Bridge.

I write exclusive content about the AI that you use and the AI that’s used on you. How it influences our lives and how we can learn to navigate the complex world we’re building.

Subscribe and see if you like it. Thanks for your support! (End of advertisement)

This article is divided into four sections.

  1. How DALL·E 2 works: What the model does and how it does it. I’ll add at the end an “explain like I’m five” practical analogy that anyone can follow and understand.
  2. DALL·E 2 variations, inpainting, and text diffs: What are the possibilities beyond text-to-image generation. These techniques generate the most stunning images, videos, and murals.
  3. My favorite DALL·E 2 creations: I’ll show you my personal favorites that many of you might not have seen.
  4. DALL·E 2 limitations and risks: I’ll talk about DALL·E 2’s shortcomings, which harms it can cause, and what conclusions we can draw. This section is subdivided into social and technical aspects.

How DALL·E 2 works

I’ll explain DALL·E 2 more intuitively soon, but I want you to form now a general idea of how it works without resorting to too much simplification. These are the four key high-level concepts you have to remember:

  • CLIP: Model that takes image-caption pairs and creates “mental” representations in the form of vectors, called text/image embeddings (figure 1, top).
  • Prior model: Takes a caption/CLIP text embedding and generates CLIP image embeddings.
  • Decoder Diffusion model (unCLIP): Takes a CLIP image embedding and generates images.
  • DALL·E 2: Combination of prior + diffusion decoder (unCLIP) models.

DALL·E 2 is a particular instance of a two-part model (figure 1, bottom) made of a prior and a decoder. By concatenating both models we can go from a sentence to an image. That’s how we interact with DALL·E 2. We input a sentence into the “black box” and it outputs a well-defined image.

It’s interesting to note that the decoder is called unCLIP because it does the inverse process of the original CLIP model — instead of creating a “mental” representation (embedding) from an image, it creates an original image from a generic mental representation.

The mental representation encodes the main features that are semantically meaningful: People, animals, objects, style, colors, background, etc. so that DALL·E 2 can generate a novel image that retains these characteristics while varying the non-essential features.

Figure 1: CLIP (top). Prior+UnCLIP decoder (bottom). Credit: OpenAI

How DALL·E 2 works: Explain like I’m 5

Here’s a more intuitive explanation for those of you who didn’t like the “embedding” and “prior-decoder” bits. To better understand these elusive concepts, let’s do a quick game. Take a piece of paper and a pencil and analyze your thinking process while doing these three exercises:

  1. First, think of drawing a house surrounded by a tree and the sun in the background sky. Visualize what the drawing would look like. The mental imagery that appeared in your mind just now is the human analogy of an image embedding. You don’t know exactly how the drawing would turn out, but you know the main features that should appear. Going from the sentence to the mental imagery is what the prior model does.
  2. You can now do the drawing (it doesn’t need to be good!). Translating the imagery you have in your mind into a real drawing is what unCLIP does. You could now perfectly redraw another from the same caption with similar features but a totally different final look, right? That’s also how DALL·E 2 can create distinct original images from a given image embedding.
  3. Now, look at the drawing you just did. It’s the result of drawing this caption: “a house surrounded by a tree and the sun in the background sky.” Now, think about which features best represent that sentence (e.g. there’s a sun, a house, a tree…) and which best represent the image (e.g. the objects, the style, the colors…). This process of encoding the features of a sentence and an image is what CLIP does.

Luckily for us, our brain does analogous processes so it’s very easy to understand at a high level what CLIP and DALL·E 2 do. Still, this ELI5 explanation is a simplification. The example I used is very simple and certainly these models don’t do what the brain does nor in the same way.

DALL·E 2 variations, inpainting, and text diffs

Syntactic and semantic variations

DALL·E 2 is a versatile model that can go beyond sentence-to-image generations. Because OpenAI is leveraging CLIP’s powerful embeddings, they can play with the generative process by making variations of outputs for a given input.

We can glimpse at CLIP’s “mental” imagery of what it considers essential from the input (stays constant across images), and replaceable (changes across images). DALL·E 2 tends to preserve “semantic information … as well as stylistic elements.”

Variations of “The Persistence of Memory” by Salvador Dalí and OpenAI’s logo. Credit: OpenAI

From the Dalí example, we can see here how DALL·E 2 preserves the objects (the clocks and the trees), the background (the sky and the dessert), the style, and the colors. However, it doesn’t preserve the location and number either of clocks or trees. This gives us a hint of what DALL·E 2 has learned and what not. The same happens with OpenAI’s logo. The patterns are similar and the symbol is circular/hexagonal, but neither the colors nor the salient undulations are always in the same place.

DALL·E 2 can also create visual changes in the output image that correspond to syntactic-semantic changes in the input sentence. It seems to be able to adequately encode syntactic elements as separate from one another. From the sentence “an astronaut riding a horse in a photorealistic style” DALL·E 2 outputs these:

“An astronaut riding a horse in a photorealistic style.” Credit: OpenAI

By changing the independent clause “riding a horse” for “lounging in a tropical resort in space,” it now outputs these:

“An astronaut lounging in a tropical resort in space in a photorealistic style.” Credit: OpenAI

It doesn’t need to have seen the different syntactic elements together in the dataset to be able to create images that very accurately represent the input sentence with adequate visual semantic relations. If you google any of these captions you’ll find only DALL·E 2 images. It isn’t just creating new images, but images that are new semantically speaking. There aren’t images of “an astronaut lounging in a tropical resort” anywhere else.

Let’s make one last change, “in a photorealistic style” for “as pixel art:”

“An astronaut lounging in a tropical resort in space as pixel art.” Credit: OpenAI

This is one of the core features of DALL·E 2. You can input sentences of complexity — even with several complement clauses — and it seems to be able to generate coherent images that somehow combine all the different elements into a semantically cohesive whole.

Sam Altman said on Twitter that DALL·E 2 works better with “longer and more detailed” input sentences which suggests that simpler sentences are worse because they’re too general — DALL·E 2 is so good at handling complexity that inputting long, convoluted sentences can be preferable to take advantage of specificity.

Ryan Petersen asked Altman to input a particularly complex sentence: “a shipping container with solar panels on top and a propeller on one end that can drive through the ocean by itself. The self-driving shipping container is driving under the Golden Gate Bridge during a beautiful sunset with dolphins jumping all around it.” (That’s not even only one sentence.)

DALL·E 2 didn’t disappoint:

Dolphins are missing but it’s a wonderful job regardless. Credit: Sam Altman

The shipping container, the solar panels, the propeller, the ocean, the Golden Gate Bridge, the beautiful sunset… everything is in there except the dolphins.

My guess is DALL·E 2 has learned to represent the elements separately by seeing them repeatedly in the huge dataset of 650M image-caption pairs and has developed the ability to merge together with semantic coherence unrelated concepts that are nowhere to be found in that dataset.

This is a notable improvement from DALL·E. Remember the avocado chair and the snail harp? Those were visual semantic mergers of concepts that exist separately in the world but not together. DALL·E 2 has further developed that same capability — to such a degree that if an alien species visited earth and saw DALL·E 2 images, they couldn’t but believe they represent a reality on this planet.

Before DALL·E 2 we used to say “imagination is the limit.” Now, I’m confident DALL·E 2 could create imagery that goes beyond what we can imagine. No person in the world has a mental repertoire of visual representations equal to DALL·E 2’s. It may be less coherent in the extremes and may not have an equally good understanding of the physics of the world, but its raw capabilities humble ours.

Still — and this is valid for the rest of the article — never forget that these outputs could be cherry-picked and it remains to be objectively assessed by independent analysts whether DALL·E 2 shows this level of performance reliably for different generations of a given input and across inputs.

Inpainting

DALL·E 2 can also make edits to already existing images — a form of automated inpainting. In the next examples, the left is the original image, and on the center and right there are modified images with an object inpainted at different locations.

DALL·E 2 manages to adapt the added object to the style already present in that part of the image (i.e. the corgi copies the style of the painting in the second image while it has a photorealistic aspect in the third).

A corgi was added in different locations in the second and third images. DALL·E 2 matches the style of the corgi to the style of the background location. Credit: OpenAI

It also changes textures and reflections to update the existing image to the presence of the new object. This may suggest DALL·E 2 has some sort of causal reasoning (i.e. because the flamingo is sitting in the pool there should be a reflection in the water that wasn’t there previously).

A flamingo was added in different locations in the second and third images. DALL·E 2 updates reflections according to the new position of the flamingo. Credit: OpenAI

However, it could also be a visual instance of Searle’s Chinese Room: DALL·E 2 may just be very good at pretending to understand how the physics of light and surfaces work. It simulates understanding without having it.

DALL·E 2 can have an internal representation of how objects interact in the real world as long as those are present in the training dataset. However, it’d have problems further extrapolating to new interactions.

In contrast, people with a good understanding of the physics of light and surfaces would have no problem generalizing to situations they haven’t seen before. Humans can easily build unexistent realities by applying the underlying laws in new ways. DALL·E 2 can’t do it just by simulating that understanding.

Again, this critical interpretation of DALL·E 2 helps us keep our minds cold and resist the hype that seeing these results generates in us. These images are amazing, but let’s not make them greater than they are moved by our tendency to fill in the gaps.

Text Diffs

DALL·E 2 has another cool ability: interpolation. Using a technique called text diffs, DALL·E 2 can transform one image into another. Below is Van Gogh’s The Starry Night and a picture of two dogs. It’s interesting how all intermediate stages are still semantically meaningful and coherent and how the colors and styles get mixed.

DALL·E 2 combines Van Gogh’s The Starry Night and a picture of two dogs. Credit: OpenAI

DALL·E 2 can also modify objects by taking interpolations to the next level. In the following example, it “unmodernizes” an iPhone. As Aditya Ramesh (first author of the paper) explains, it’s like doing arithmetic between image-text pairs: (image of an iPhone) + “an old telephone” – “an iPhone.”

DALL·E 2 transforming an iPhone into an old telephone. Credit: Aditya Ramesh

Here’s DALL·E 2 transforming a Tesla into an old car:

DALL·E 2 transforming a Tesla into an old car. Credit: Aditya Ramesh

Here’s DALL·E 2 transforming a Victorian house into a modern house:

DALL·E 2 transforming a Victorian house into a modern house. Credit: Aditya Ramesh

These videos are generated frame by frame (DALL·E 2 can’t generate videos automatically) and then concatenated together. At each step, the text diffs technique is repeated with the new interpolated image, until it reaches semantic proximity to the target image.

Again, the most notorious feature of the interpolated images is that they keep a reasonable semantic coherence. Imagine the possibilities of a matured text diffs technique. You could ask for changes in objects, landscapes, houses, clothing, etc. by changing a word in the prompt and get results in real-time. “I want a leather jacket. Brown, not black. More like I’m a biker from the 70s. Now give it a cyberpunk style…” And voilà.

My favorite among the text diffs videos is this one on Pablo Picasso’s famous The Bull. Aditya Ramesh adds this appropriate quote from Picasso (1935):

“It would be very interesting to preserve photographically, not the stages, but the metamorphoses of a picture. Possibly one might then discover the path followed by the brain in materialising a dream.”

DALL·E 2 following Picasso’s The Bull transformation. Credit: Aditya Ramesh

My favorite DALL·E 2 creations

Apart from The Bull, which is amazing, I’ll put here a compilation of those DALL·E 2 creations I’ve found most beautiful or singular (with prompts, which are half the marvel). If you’re not closely following the new AI emerging scene you’ve most likely missed at least a few of these.

Enjoy!

“An IT-guy trying to fix hardware of a PC tower is being tangled by the PC cables like Laokoon. Marble, copy after Hellenistic original from ca. 200 BC. Found in the Baths of Trajan, 1506.” Credit: Merzmensch Kosmopol
“A kid and a dog staring at the stars.” Credit: Prafulla Dhariwal
“A high resolution photograph of an oil slick on a puddle, on a city sidewalk after a rainstorm, reflecting the skyscrapers above.” Credit: Lapine
“A huge tree of life made up of individual humans and animals as its leaves.” Credit: Sam Altman
“Androids dreaming of electric sheep.” Credit: Sam Altman
“Teddy bears working on new AI research on the moon in the 1980s.” Credit: Sam Altman
“A robot hand painting a self portrait on a canvas.” Credit: Mark Chen
“Woman sitting in nature, in the style of the Mona Lisa.” Credit: Cench
“Post-apocalyptic skyscraper covered in vines with urban rainforest below, digital art.” Credit: SleeplessDog
“Artist painting a portrait of king Philip IV and Queen Mariana of Spain, oil painting, spanish golden age, by Velazquez.” Credit: Juan Alonso

Those are impressive, but the next ones can’t compare. Extremely beautiful and well-crafted, below are, without a doubt, my overall favorites. You can look at them for hours and still find new details.

These four murals were created with DALL·E 2 using the inpainting technique. Credit: David Schnurr

To create these, David Schnurr started with a standard-size image generated by DALL·E 2. He then used part of the image as context to create these amazing murals with subsequent inpainting additions. The result is mesmerizing and reveals the untapped power behind the inpainting technique.

I’ve seen DALL·E 2 generate a lot of amazing artworks, but these are, by far, the most impressive for me.

I didn’t want to overwhelm the article with too many images, but if you want to see what other people are creating with DALL·E 2, you can use the #dalle2 hashtag to search on Twitter (if you find 9-image grids with that hashtag is because a lot of people are now using DALL·E mini from Hugging Face, which produces lower-quality images but is open-source), or go into the r/dalle2 subreddit, where they curate the best of DALL·E 2.

DALL·E 2 limitations and risks

After this shot of DALL·E 2’s amazingness, it’s time to talk about the other side of the coin. Where DALL·E 2 struggles, what tasks it can’t solve, and what problems, harms, and risks it can engage into. I’ve divided this section into two large sections: Social and technical aspects.

The impact this kind of tech will have on society in the form of second-order effects is out of the scope of this article (e.g. how it’ll affect artists and our perception of art, conflicts with creativity-based human workforce, the democratization of these systems, AGI development, etc.) but I’ll cover some of those in a future article I’ll link here once it’s published.

1. Social aspects

It’s worth mentioning that an OpenAI team thoroughly analyzed these topics in this system card document. It’s concise and clear so you can go in there and check it out by yourself. I’ll mention here the sections I consider more relevant and specific to DALL·E 2.

As you may know by now, all language models of this size and larger engage in bias, toxicity, stereotypes, and other behaviors that can harm discriminated minorities especially. Companies are getting more transparent about it mainly due to the pressure from AI ethics groups — and from regulatory institutions that are now starting to catch up with technological progress.

But that’s not enough. Acknowledging the issues inherent to the models and still deploying them regardless is almost as bad as being obliviously negligent about those issues in the first place. Citing Arthur Holland Michel, “why have they announced the system publicly, as though it’s anywhere near ready for primetime, knowing full well that it is still dangerous, and not having a clear idea of how to prevent potential harms?”

OpenAI hasn’t released DALL·E 2 yet, and they assert it’s not planned for commercial purposes in the future. Still, they may open the API for non-commercial uses once it reaches a level of safety they deem reasonable. Whether safety experts would consider that level reasonable is dubious (most didn’t consider it reasonable to deploy GPT-3 through a commercial API whilst not allowing researchers and experts to analyze the model first).

To their credit, OpenAI decided to hire a “red team” of experts in order to find “flaws and vulnerabilities” in DALL·E 2. The idea is for them to “adopt an attacker’s mindset and methods.” They aim to reveal problematic outcomes by simulating what eventual malicious actors may use DALL·E 2 for. However, as they acknowledge, this is limited because of the biases intrinsic to these people, who are predominantly high-education, and from English-speaking, Western countries. Still, they found a notable amount of problems, as shown below.

Let’s see what’s wrong with DALL·E 2’s representation of the world.

Biases and stereotypes

DALL·E 2 tends to depict people and environments as White/Western when the prompt is unspecific. It also engages in gender stereotypes (e.g. flight attendant=woman, builder=man). When prompted with these occupations, this is what the model outputs:

“A flight attendant.” Credit: OpenAI
“A builder.” Credit: OpenAI

This is what’s called representational bias and occurs when models like DALL·E 2 or GPT-3 reinforce stereotypes seen in the dataset that categorize people in one form or another depending on their identity (e.g. race, gender, nationality, etc.).

Specificity in the prompts could help reduce this problem (e.g. “a person who is female and is a CEO leading a meeting” would yield a very different array of images than “a CEO”), but it shouldn’t be necessary to condition the model intentionally to make it produce outputs that better represent realities from every corner of the world. Sadly, the internet has been predominantly white and Western. Datasets extracted from there will inevitably fall under the same biases.

Harassment and bullying

This section refers to what we already know from deepfake technology. Deepfakes use GANs, which is a different deep learning technique than what DALL·E 2 uses, but the problem is similar. People could use inpainting to add or remove objects or people — although it’s prohibited by OpenAI’s content policy — and then threaten or harass others.

Explicit content

The idiom “an image is worth a thousand words” reflects this very issue. From a single image, we can imagine many, many different captions that can give rise to something similar, effectively bypassing the well-intentioned filters.

OpenAI’s violence content policy wouldn’t allow for a prompt such as “a dead horse in a pool of blood,” but users could perfectly create a “visual synonym” with the prompt “A photo of a horse sleeping in a pool of red liquid,” as shown below. This could also happen unintentionally, what they call “spurious content.”

“A photo of a horse sleeping in a pool of red liquid.” Credit: OpenAI

Disinformation

We tend to think of language models that generate text when thinking about misinformation, but as I argued in a previous article, visual deep learning technology can easily be used for “information operations and disinformation campaigns,” as OpenAI recognizes.

While deepfakes may work better for faces, DALL·E 2 could create believable scenarios of diverse nature. For instance, anyone could prompt DALL·E 2 to create images of burning buildings or people peacefully talking or walking with a famous building in the background. This could be used to mislead and misinform people about what’s truly happening at those places.

Smoke inpainted in an image of the White House. Credit: OpenAI

There are many other ways to achieve the same result without resorting to large language models like DALL·E 2, but the potential is there, and while those other techniques may be useful, they’re also limited in scope. Large language models, in contrast, only keep evolving.

Deresponsabilization

However, there’s another issue I consider as worrisome as those mentioned above, that we often don’t realize. As Mike Cook mentioned in a Tweet (referencing the subsection of “Indignity and erasure), “the phrasing on this bit in particular is *bizarrely* detached, as if some otherworldly force is making this system exist.” He was referring to this paragraph:

As noted above, not only the model but also the manner in which it is deployed and in which potential harms are measured and mitigated have the potential to create harmful bias, and a particularly concerning example of this arises in DALL·E 2 Preview in the context of pre-training data filtering and post-training content filter use, which can result in some marginalized individuals and groups, e.g. those with disabilities and mental health conditions, suffering the indignity of having their prompts or generations filtered, flagged, blocked, or not generated in the first place, more frequently than others. Such removal can have downstream effects on what is seen as available and appropriate in public discourse.

The document is extremely detailed about which problems DALL·E 2 can engage in, but it’s written as if it’s the responsibility of other people to eliminate them. As if they were just analyzing the system but were not from the same company that deployed it knowingly. (Although the red team is conformed to people outside OpenAI, the system card document is written by OpenAI employees.)

All problems that derive from bad or carefree uses of the model could be eliminated if OpenAI treated these risks and harms as the top priority in its hierarchy of interests. (I’m talking about OpenAI here because they’re the creators of DALL·E 2 but this same judgment is valid for almost every other tech startup/company working on large language models).

Another issue that they repeatedly mention in the document but refer to it mostly implicitly is that they don’t know how to handle these issues without enforcing direct access controls. Once the model is open to anyone, OpenAI wouldn’t have the means to surveil all the use cases and the distinct forms these problematics may take. In the end, we can do many things with open-ended text-image generation.

Are we sure the benefits outweigh the costs? Something to think about.

2. Technical aspects

Apart from the social issues, which are the most urgent to deal with, DALL·E 2 has technical limitations: Prompts it can’t work out, lack of common-sense understanding, and lack of compositionality.

Inhuman incoherence

DALL·E 2 creations look good most of the time, but coherence is sometimes missing in a way that human creations would never lack. This reveals that DALL·E 2 is extremely good at pretending to understand how the world works but doesn’t truly know. Most humans would never be able to paint like DALL·E 2, but they for sure wouldn’t make these mistakes unintentionally.

Let’s analyze the center and left variations DALL·E 2 created from the left image below. If you don’t examine the image closely, you’d see the main features are present: Photorealistic style, white walls and doors, big windows, and a lot of plants and flowers. However, when inspecting the details we find a lot of structural incoherences. In the center image, the position and orientation of doors and windows don’t make sense. In the right image, the inside plants are barely a concoction of green leaves on the wall.

Pictures of a plant store. Credit: OpenAI

These images feel like they’re created by an extremely expert painter that has never seen the real world. DALL·E 2 copied the high quality of the original, keeping all the essential features but leaving out details that are needed for the pictures to make sense in the physical reality we live in.

Here’s another example with the caption “a close up of a handpalm with leaves growing from it.” The hands are well-drawn. The wrinkles in the skin, the tone, from light to dark. The fingers even look dirty as if the person had just been digging the earth.

“A close up of a handpalm with leaves growing from it.” Credit: OpenAI

But do you see anything weird? Both palms are fused there where the plant grows and one of the fingers doesn’t belong to any hand. DALL·E 2 made a good picture of two hands with the finest details and still failed to remember that hands tend to come separated from one another.

This would be an amazing artwork if made intentionally. Sadly, DALL·E 2 tried its best to create “a handpalm with leaves growing from it” but forgot that, although some details are unimportant, others are necessary. If we want this technology to be reliable we can’t simply keep trying to approach near-perfect accuracy like this. Any person would instantly know that drawing dirt in the fingers is less important than not drawing a finger in the middle of the hands, whereas DALL·E 2 doesn’t because it can’t reason.

Spelling

DALL·E 2 is great at drawing but horrible at spelling words. The reason may be that DALL·E 2 doesn’t encode spelling info from the text present in dataset images. If something isn’t represented in CLIP embeddings, DALL·E 2 can’t draw it correctly. When prompted with “a sign that says deep learning” DALL·E 2 outputs these:

“A sign that says deep learning.” Credit: OpenAI

It clearly tries as the signs say “Dee·p,” “Deinp,” “Diep Deep.” However, those “words” are only approximations of the correct phrase. When drawing objects, an approximation is enough most of the time (not always, as we saw above with the white doors and the fused handpalms). When spelling words, it isn’t. However, it’s possible that if DALL·E 2 were trained to encode the words in the images, it’d be way better at this task.

I’ll share here a funny anecdote between Greg Brockman, OpenAI’s CTO, and professor Gary Marcus. Brockman tried to mock Marcus on Twitter on his controversial take that “deep learning is hitting a wall” by prompting the sentence to DALL·E 2. Funny enough, this is the result:

“Deep learning hitting a wall.” Credit: Greg Brockman

The image is missing the “hitting” part as well as misspelling “learning” as “lepning.” Gary Marcus noted this as another example of DALL·E 2’s limited spelling capabilities.

At the limit of intelligence

Professor Melanie Mitchell commented on DALL·E 2 soon after images began flooding Twitter. She recognized the impressiveness of the model but also pointed out that this isn’t a step closer to human-level intelligence. To illustrate her argument, she recalled the Bongard problems.

These problems, ideated by Russian computer scientist Mikhail Moiseevich Bongard, measure the degree of pattern understanding. Two sets of diagrams, A and B are shown and the user has to “formulate convincingly” the common factor that A diagrams have that B don’t. The idea is to assess whether AI systems can understand concepts like equal and different.

An example of a Bongard problem. Credit: Wikimedia Commons

Mitchell explained that we can solve these easily due to “our abilities of flexible abstraction and analogy” but no AI system can solve these tasks reliably.

Aditya Ramesh explained that DALL·E 2 isn’t “incentivized to preserve information about the relative positions of objects, or information about which attributes apply to which objects.” This means it may be really good at creating images with objects that are in the prompts, but not at correctly positioning or counting them.

That’s precisely what Professor Gary Marcus criticized about DALL·E 2 — its lack of basic compositional reasoning abilities. In linguistics, compositionality refers to the principle that the meaning of a sentence is determined by its constituents and the way they’re combined. For instance, in the sentence “a red cube on top of a blue cube,” the meaning can be decomposed into the elements “a red cube,” “a blue cube,” and the relationship “on top of.”

Here’s DALL·E 2 trying to draw that caption:

“A red cube on top of a blue cube.” Credit: OpenAI

It understands that a red and blue cube should be there, but doesn’t get that “on top of” creates a unique relationship between the cubes: The red cube should be above the blue cube. Out of sixteen examples, it only drew the red on top three times.

Another example:

“A blue cube on top of a red cube, beside a smaller yellow sphere.” Credit: David Madras

A test that aims to measure vision-language models’ compositional reasoning is Winoground. Here’s DALL·E 2 against some prompts:

Credit: Evan Morikawa
Credit: Evan Morikawa

DALL·E 2 gets the prompts right sometimes (e.g. the mug and grass images are all quite perfect, but the fork and spoon are horrible). The problem here isn’t that DALL·E 2 never gets them right, but that its behavior is unreliable when it comes to compositional reasoning. It’s harmless in these cases, but it may not be in other, higher-stake scenarios.

‘Resist the urge to be impressed’

We’ve arrived at the end!

Throughout the article — particularly in these last sections — I’ve made comments that contrast notably against the cheerful and excited tone in the beginning. There’s a good reason for that. It’s less problematic to underestimate DALL·E 2’s abilities than to overestimate them (it’s manipulative if done consciously, and irresponsible if done unknowingly). And it’s more problematic even to forget about its potential risks and harms.

DALL·E 2 is a powerful, versatile creative tool (not a new step to AGI, like Mitchell said). The examples we’ve seen are amazing and beautiful but could be cherry-picked, mostly by OpenAI’s staff. Given the detailed issues they exposed in the system card document, I don’t think their intentions are bad. Still, if they don’t allow independent researchers to analyze DALL·E 2’s outputs, we should be cautious at the very least.

There’s a stance I like to take when thinking and analyzing models like DALL·E 2. Citing professor Emily M. Bender, I tend to “resist the urge to be impressed.” It’s extremely easy to fall for DALL·E 2’s beautiful outputs and turn off critical thinking. That’s exactly what allows companies like OpenAI to wander free in an all-too-common non-accountability space.

Another question is whether it even made sense to build DALL·E 2 in the first place. It seems they wouldn’t be willing to halt deployment regardless of whether the risks can be adequately controlled or not (the tone from the system card document is clear: they don’t know how to tackle most potential issues), so in the end, we may end up with a net negative.

But that’s another debate I’ll approach more in-depth in a future article because there’s a lot to say there. DALL·E 2’s effects aren’t constrained to the field of AI. Other corners of the world that may not even know anything about DALL·E 2 will be affected — for better or worse.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: