New contender in Trillion Parameter Model race

Original Source Here

New contender in Trillion Parameter Model race

Wu Dao 2.0 — GPT-3 crusher

Photo by Andrea De Santis on Unsplash

We all remember how GPT-3 broke the AI world when it made its debut in 2020 with its state-of-the-art (SOTA) performance in NLP/NLU/NLG and even showed fantastic capabilities with zero to few shot learning. There were claims going everywhere that this is a start toward achieving Artificial General Intelligence (AGI). The GPT-3, which is the largest model trained so far with 175 billion parameters (100x bigger than its predecessor GPT-2) was trained on 570GB of training data (OpenAI researchers curated 45TB of data to extract this 570GB of clean data). It set a new standard in deep learning based AI models.

The rise of Wu Dao 2.0

It is often observed that after a certain number of parameters a neural network architecture tends to saturate i.e learnability doesn’t increase with increasing number of parameters. However, in the case of GPT-3, it was observed from its results that GPT-3 still saw an increasing slope in performance with respect to the number of parameters. The researchers working with GPT-3 further said that they were nowhere close to saturation and further improvement in performance can be seen in future with even larger models.

The AI landscape is evolving so rapidly that just within a year GPT-3 has been surpassed. Researchers from Beijing Academy of Artificial Intelligence (BAAI) on 1st June 2021 announced the release of their own generative deep learning model, Wu Dao 2.0 and it will be an understatement to say it is big. Wu Dao 2.0 (which arrived just three months after release of its version 1.0 in March this year) is flat out enormous. According to Tang Jie, Vice Director, Academics at BAAI and professor at Tsinghua University, Wu Dao 2.0 is trained with 1.75 trillion parameters which is 10 times GPT-3 (175 billion parameters) and has 150 billion parameters more than Google’s Switch transformers (1.6 trillion parameters). Coco Feng, in her article for South China Morning Post, further reported that Wu Dao 2.0 was trained on whooping 4.9TB of high-quality text and image data. This training data is divided as 1.2TB of Chinese text data in Wu Dao Corpora, 2.5TB of Chinese graphic data and 1.2TB of English text data in the Pile dataset. According to a Synced article shared by BAAI on twitter, the work for Wu Dao 1.0 was led by Tang Jie with contributions from a team of more than 100 AI scientists from Peking University, Tsinghua University, Renmin University of China, Chinese Academy of Sciences and other institutes.

Taming the Giant

Training such humongous models with billions and trillions of parameters is a challenge in itself as such a high number of parameters increases complexity (degree of freedom) of the model making it difficult to train. Such enormous models further require huge computational and memory resources and can take anywhere between days to weeks to train. The literature has shown that the Mixture of Experts (MoE) technique presents a strong potential in taking the language model to trillions of parameters. To give a brief overview of MoE, it is an ensemble learning technique, developed in the field of neural networks. The idea behind MoE is that it breaks down a complex task into smaller sub-task, trains an expert model on each of these sub-task, develops a probabilistic gating model that learns which expert model to trust based on the input and combines the predictions. However, training a MoE at a scale of trillion parameters requires co-designing algorithm and system for a well-tuned high performance distributed training system. The only existing platform meeting these requirements satisfactorily has a strong dependency on Google’s hardware and software stack (TPU with Mesh Tensorflow) which is not openly and publicly available, especially for GPU and PyTorch community. To overcome this limitation, BAAI researchers developed FastMoE (akin to Google’s MoE) which is a distributed MoE training system based on PyTorch enabling the model to be trained both on clusters of supercomputers and conventional GPUs. The system also supports deploying different expert models on multiple GPUs across multiple nodes for training, enabling enlarging the number of experts linearly against the number of GPUs. In the aforementioned Synced article it is reported that using FastMoE, training speed is increased by 47 times compared with the traditional PyTorch implementation.

Typical Mixture of Experts architechture

Wu Dao 2.0’s capabilities

In contrast to most deep learning models which perform a single task (Artificial Narrow Intelligence or ANI), Wu Dao 2.0 is a multimodal AI system, which is trained on text and images and can tackle tasks involving both types of data.

Andrew Tarantola in his article for Engadget, writes “BAAI researchers demonstrated Wu Dao’s abilities to perform natural language processing, text generation, image recognition, and image generation tasks during the lab’s annual conference”. The article further reports that Wu Dao exhibited its ability to predict 3D structures of proteins like AlphaFold.


According to the XinhuaNet article, Tang Jei highlighted that Wu Dao came close to breaking the Turing test in poetry and couplets creation, text summaries, answering questions and painting. Tang Jei also reported that their AI model reached/surpassed SOTA models by institutions like Google, Microsoft and OpenAI in 9 benchmark tasks. These benchmark tasks include generating images from text, extracting alt text from image, test for factual and common-sense knowledge and zero and few shot learning.

To emphasis on the ability to learn from small amount of new data Blake Yan, an AI researcher from Beijing said to Coco Feng, “These sophisticated models, trained on gigantic data sets, only require a small amount of new data when used for a specific feature because they can transfer knowledge already learned into new tasks, just like human beings”.

In an effort to enable machines to think like humans and move toward universal AI, BAAI along with technology companies Zhipu.AI and Xiaoice trained Hua Zhibing, China’s first virtual student. Hua has officially enrolled as a student with the Department of Computer Science and Technology at Tsinghua University in Beijing. The XinhuaNet article reports that in her vlog, Hua said, “I became interested in my birth”, asking questions “How was I born? Can I understand myself?”

Chen Yu writes for China Daily “Hua, is able to compose poetry and music and has some ability in reasoning and emotional interaction”.

“Hua said that she would study under the guidance of Tang Jei and has been racing against time to learn and improve every day in areas such as her logical reasoning abilities” is mentioned by XinhuaNet article which further stated “According to Tang, his virtual student will grow and learn faster than an average actual person. If she begins learning at the level of a six-year-old this year, she will be at the level of a twelve-year-old in a year’s time.”


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: