The NLP Cypher | 07.11.21

Original Source Here

Welcome back! Hope you had a great week. We have a new leader on the SuperGLUE benchmark with a new Ernie model from Baidu comprising of 10 billion parameters trained on on a 4TB corpus. FYI, human baseline was already beat by Microsoft’s DeBERTa model at the beginning of the year… time for a new SuperSuperGLUE benchmark???


The Codex Paper

BTW, if you are still interested in GitHub’s CoPilot, I stumbled upon the Codex paper this week:


DeepMind’s Perceiver

DeepMind’s Perceiver transformer allows it to take a variety of modalities (vision, audio, text) as its input and able to achieve competitive outcomes in benchmark performance. Usually a model architecture is specialized to a specific domain, however what the Perceiver is attempting to do here is being able to generalize to any domain using a single architecture. 😎


The Long-Short Transformer

Adding to the list of efficient transformers, comes the LS-Transformer that be both used for autoregressive and bi-directional models and for both language and vision domains. Model obtains SOTA results on the Long Range Arena, char-level language modeling and ImageNet classification.



Deep Learning Videos

170 video lectures from Sebastian Raschka in 2021 using PyTorch.

Table of Contents

Python Deep Learning Notebooks

Jupyter notebooks implementing the code samples found in the book Deep Learning with Python, 2nd Edition.

Hugging Face’s Model Parallelism Intro

A conceptual intro to model parallelism touching on several techniques highlighted below. HF also highlights which of the techniques are currently implemented in their library.

  1. DataParallel (DP) — the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.
  2. TensorParallel (TP) — each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on horizontal level.
  3. PipelineParallel (PP) — the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
  4. Zero Redundancy Optimizer (ZeRO) — Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model does’t need to be modified. It also supports various offloading techniques to compensate for limited GPU memory.
  5. Sharded DDP — is another name for the foundational ZeRO concept as used by various other implementations of ZeRO.


Faster Inference in Haystack’s QA System

Reducing the ‘top_k_retriever’ parameter is the trick here. This parameter represents the number of documents the reader model evaluates.

Common Errors in Training Data

Blog post reviewing three situations where your data goes wrong:

  1. Labeling Errors
  2. Unbalanced Training Data
  3. Bias in Labeling Process

Software Updates

spaCy 3.1:

Adapters 2.1.0:

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: