This AI Can Create Video From Text Prompt

https://miro.medium.com/max/1200/0*gixFq60nxJiBeLF6

Original Source Here

This AI Can Create Video From Text Prompt

Text-to-image and image-to-text generator tools are already gaining steam. How about video-to-text?

Photo by Elena Mozhvilo on Unsplash

It’s only been a few months since the release of the revolutionary text-to-image AI generators, Dall-E2 and MidJourney.

How about video, though?

Apparently, there’s a brand new (and probably) the first open-source large-scale pre-trained text-to-video model called CogVideo.

Simply said, it’s an AI tool that can produce videos without requiring actual filming!

Let’s talk more about it:

  • What is CogVideo?
  • How does it work?
  • What are the current limitations?
  • What’s next?

What Is CogVideo?

Straight from their demo website, this is how CogVideo is described:

CogVideo is the largest pre-trained transformer for text-to-video generation in the general domain, which is of 9.4 billion parameters.

It adopts a multi-frame-rate hierarchical training technique and elegantly and effectively refines a pre-trained text-to-image generative model (CogView2) for text-to-image production.

That’s quite a mouthful, but check out this collaged demo from their official GitHub repo.

Screenshot by Jim Clyde Monge from CogVideo Github

Pretty awesome, right? The videos look like they’re taken straight out of a TV commercial.

How It Works

Here’s the multi-frame-rate hierarchical generation framework in CogVideo.

CogVideo methodology

Input sequence includes frame rate, text, and frame tokens. The input frame is a separator token, inherited from CogView2.

Stage 1: The frames are generated sequentially on the condition of frame rate and text.

Stage 2: Generated frames are re-input as bidirectional attention regions to recursively interpolate frames. The frame rate can be adjusted during both stages. Bidirectional attention regions are highlighted in blue, and unidirectional regions are highlighted in green.

There’s a simple web app…

The web application that you can use for testing purposes is incorporated into the Hugging Faces machine learning app library.

The user interface is rather straightforward; it consists of a “Run” button, a “Seed” slider control, and an “Input Text” field where you enter a text description.

That’s it. Here’s a little screenshot from the web tool with a little sample prompt of a cat playing chess.

CogVideo text-to-video web tool

What is “Seed”?

The seed provides the random number generator with a place to start. For instance, using -1 as the default causes it to choose a random seed. This implies that even if all other values are the same, the outputs will vary every time. By entering a number, you give the generator permission to duplicate earlier outcomes.

Okay, if you want to simply get impressed and experiment with various text prompts, CogVideo has released another demo web app you can access here.

Prompt: A smiling woman wearing a red dress.
CogVideo demo web app

Here’s the result in action.

A smiling woman wearing a red dress by CogVideo

Ain’t that impressive? It’s a hyperrealistic video of a smiling woman in a red dress.

Current Limitations

Even though CogVideo’s most recent progress is already very impressive, there are still many obstacles to overcome:

  • The AI model can only generate a resolution of 480×480, a duration of 4 seconds, and a frame rate of 8 fps.
  • Since the model was trained using 9 billion datasets, starting from scratch would be prohibitively expensive in terms of computing.
  • It’s still quite young. The model is unable to comprehend sophisticated movement semantics because of the dearth and poor relevance of text-video datasets. By far, only 41,250 videos make up the largest annotated text-video dataset.
  • The model accepts only Chinese as input. English input will have to be translated to simplified Chinese when fed into the prompt.

If you want to try it yourself, you’re likely to wait a long time (around an hour) for the video to get generated because the container is 63GB running in NVidia A100 GPU.

What’s Next?

While CogVideo is still in its infancy, the videos it can generate are a little bit short, but the potential for this technology is huge.

For one, it could create more realistic and lifelike character animations for movies and video games.

Additionally, it could be used to create educational videos or to automatically generate video content from text articles.

In a few years, this will allow people to create videos from text, with no need for filming or editing. The implications are massive — this could change the way we create and consume video content forever.

Final Thoughts

Overall, CogVideo has the potential to be a powerful tool for businesses that want to create videos without incurring high production costs. As the technology develops, it will be interesting to see how well it performs and what other applications it may be used for.

But one thing is for sure: AI video generators are here, and are about to change the video landscape, and I can’t wait to see what’s next.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: