Best of Arxiv — Readings for July 2021

Original Source Here

Best of Arxiv — Readings for July 2021

A monthly selection of recent ML papers: the comeback from Mixture of Experts, attention might not be that special and what is it like to think like a transformer?

Image by author.

Staying on top of your reading list is hard, and finding which papers should be on that list can be even harder. At Zeta Alpha we’re always keeping a close eye to the latest ML research, so we’re sharing a monthly selection of recent papers to surface what we believe will be impactful publications, mostly based on each work’s contributions and the authors’ influence. Don’t take this list as comprehensive: we have our biases like everyone else, but hey there’s only so much you can choose out of 4000+ papers.

This month we bring Volunteer Computing to the forefront, more Transformers, Mixture of Experts and much more. Enjoy!

By Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin et al.

❓Why → Cars spend almost all their lifetime parked, and similarly, a big chunk of the world’s compute is standing idle most of the time. This paper shows how volunteer computing — where different parties volunteer to provide compute resources — can be used to successfully train a large Language Model.

💡Key insights → Volunteer Computing (VC) is a paradigm where various parties collaborate by providing compute resources for a common algorithm (e.g. people letting their personal computers run stuff for someone else while they’re sleeping). Though promising, VC is no free lunch: there’s many considerations and careful design decisions needed for it to work in practice stemming from the fact that you can assume little from participants: their internet can have varying speed and latencies and hardware can range from mobile chips to high-end multi GPU nodes. You want to be able to use a heterogeneous group of volunteers setting a low bar for requirements, but you also want to avoid levelling all resources with the least common denominator.

This paper does an excellent job at presenting the existing paradigms for distributed computing while going through their main trade-offs such as node communication vs. computation. Based on this analysis, the authors propose DeLOC, where peers perform training steps on microbatches independently and asynchronously, storing the weight gradients for a fraction of the network and aggregate them with a certain frequency to update the state of the whole model. Effectively this is equivalent to training the whole model on very large batches. The way nodes communicate to synchronize on the global state of the model falls somewhere in between a “parameter server” and an “all-reduce” models (i.e. one central node does the aggregation or all nodes do the aggregation by themselves). All these hyperparameters, such as how often and which nodes communicate with each other is cleverly optimized to maximize training throughput.


The implementation is done with Hivemind, a PyTorch library specialized in Volunteer Computing (which is described more in depth below).

By Carlos Riquelme, Joan Puigcerver, Basil Mustafa, et al.

❓Why → Mixture of Experts is becoming the go-to technique for scaling models to outrageous sizes: the key advantage is the possibility of increasing the model parameters while keeping the inference computational cost constant.

💡Key insights → In a nutshell, a Mixture of Experts is a model where an input is routed to different submodels at inference time: the computational cost of inference will be determnined by the used computation path, whereas the model expressive power will be determined by the total number of parameters; sort of getting the best of both worlds.

Mixture of Experts had previously shown to be effective for Language Model transformers such as Switch Transformers¹ and systems like FastMoE², but had not yet been applied at this scale to images. The model is almost identical to the original ViT³: divides the image in patches, project linearly into patch embeddings and run through a transformer as a sequence. In this case, however, regular ViT layers are interleaved with MoE ViT layers where the MLP feed forward layer is replaced by a set of k MLP experts preceded by a router that sends each image patch through a different expert depending on the value of the input. Experts and routers are all trained by gradient descent with carefully designed loss functions to incentivize variety of experts in training to avoid collapse modes such as only having one active expert.


Perhaps the most surprising result on MoE applied to huge neural networks is that learning efficiency (i.e. the amount of compute necessary to train a model to a certain performance) is significantly improved with respect to the original ViT.


Other recent work compares CNNs and Transformers for Computer Vision: VOLO: Vision Outlooker for Visual Recognition and its popular implementation on GitHub.

By Shuangfei Zhai et al.

❓Why → Joining last month’s surge of MLP-based architectures for CV, evidence keeps accumulating that there’s not that much special about attention itself. As long as a network models the interaction of its inputs in some way, gradient descent will just find its way given sufficient parameters and data.

💡Key insights → Instead of computing attention as the conventional matrix product between query and key matrices the authors propose a simple learned pairwise bias w added to the keys matrix, which is transformed through an exponential in the form of the expression below, where t is the sequence element and all products are element-wise instead of dot-products.

The computational costs are not reduced when computing the full attention-free-attention (well only in memory space), and in my opinion, the most interesting bit is the fact that this works just fine: results are not SOTA, but they’re just high enough to raise some eyebrows. The experiments are conducted both on images (with image autoregressive modelling) and text (auto-regressive language modelling), showing the versatility of the mechanism.


You might also like… Charformer: Fast Character Transformers via Gradient-based Subword Tokenization, which in a similar spirit as ByT⁵⁶, a character-level T5 Language Model performing surprisingly well published last month.

By Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov and Artem Babenko.

❓Why → Tabular data might be one of the modalities where there’s the biggest gap in interest between academic research — where it receives little attention nowadays — and industry application, where it’s ubiquitous. Perhaps because the latest and greatest from Deep Learning doesn’t do so well here…? In any case, this paper is the best head-to-head comparison approaches for tabular data I’ve seen in a long time!

💡Key insights → Gradient Boosting Decision Trees (GBDTs) have been around since 200¹⁴ and popular implementations such as XGBoost⁷ are widely used for good reason: Deep Learning methods still don’t quite outperform it robustly and across the board.

This work presents detailed numerical results on many tabular datasets and for several algorithms and importantly, without extra optimizations and tricks for the neural networks; just however they work out-of-the-box. The most insightful bit is in my opinion in the experiments they perform using synthetic data. The authors generate synthetic tabular datasets using heuristics that accomodate for either GBDTs style decision rules or a Neural Network regression. When mixed up to different degrees, the result is a spectrum of datasets where the two techniques are expected to outperform the other. In the figure below, the leftmost side is the error on a NN friendly dataset and the rightmost for a GBDT-friendly dataset. As expected, ResNet and CatBoost show a clear trade-off between the two, but the Transformer-based classifier seems to be a jack of all trades, master of none.


See also: Tabular Data: Deep Learning is Not All You Need.

By Gail Weiss, Yoav Goldberg and Eran Yahav.

❓Why → This paper is different and fun. Having new ways to think and talk about known stuff is essential for developing new ideas, and this is an excellent example. As a bonus, while the devil’s in the details, it seems like Attention is Turing-Complete⁸ (sort of-ish?).

💡Key insights → Restricted Access Sequence Processing Language (RASP) is a programming language that naturally enables expressing the computation that a transformer performs. The gist goes as follows, RASP models a transformer as an any algorithm that manipulates sequences of length n and matrices of size n x n . An input sequence can be transformed by element-wise operations and/or by a selecting and aggregating elements whose relationship is modelled by a sort of attention matrix (aggregator). And that’s pretty much it, you can solve many tasks by only using these primitives as they show in the paper, which nicely map into Transformer computations and can be compiled into each other.


One of the most interesting insights is how restricted-attention transformers (e.g. efficient transformers) can be expressed in the RASP formalism (e.g. by setting the aggregator matrix to False in certain regions) necessarily weakening its computational expressivity. The authors showcase this in synthetic tasks such as sorting, where only full transformers succeed.

Other synthetic tasks used to showcase RASP and its use to predict and understand how a Transformer performs computations are reversing a string, making histograms, double-histograms, sorting alphabetically, returning the most frequent token, and identifying Dick-k languages.

By Gaurav Menghani.

❓Why → While not presenting any novel contribution, this is an excellent introduction from an engineer’s perspective that covers relevant techniques for efficient DL which every practitioner should know.

💡Key insights → There’re many aspects to efficiency in DL. Perhaps the most important differentiation to make is that between training, where academic research spends most resources, and inference, where large scale applications spend most resources; and this survey covers both cases.

For instance, this work introduces the reader to techniques such as data augmentation, distillation or transfer learning which have a positive impact mostly on training efficiency. In parallel, other techniques explained such as quantisation or pruning are generally used to boost inference efficiency.

Other topics covered are hyperparameter optimization, efficient architectures and infrastructure considerations about frameworks like PyTorch Mobile or TensorFlow Lite which are a key ingredient to the ecosystem. One of the most useful bits of the survey are the rules of thumb and recipes recommended at the end of each section, which help ground each technique to its use-cases.


Surveys are probably the best way to know what’s happening in a research area where you’re not an expert. Here are other recent surveys that serve as an excellent entry point to those areas: Graph Neural Networks for Natural Language Processing: A Survey, A Survey of Transformers, A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: