The Power of the Dot Product in Artificial Intelligence

Original Source Here

These basic insights can be connected to the hottest models on the market right now: large language models.

A standard task for a large language model is the translation of a sentence between two languages, say between English and German:

“I am reading an article about the importance of the dot product in AI.”

“Ich lese einen Artikel über die Bedeutung des Skalarproduktes für die KI.”

Both sentences carry approximately the same meaning but are significantly different in their representation.

The translation task can be framed as finding a nonlinear transformation of the words that correspond to approximately the same location in the latent semantic space that captures their “meaning”. The quality of the translation can then be measured by the achieved similarity.

If “measuring similarity” doesn’t make your spidey senses tingle by now, I haven’t done a good job with this article.

And indeed, the dot product shows up at the heart of transformer models, which have become the foundation of modern natural language processing (NLP) and many other machine-learning tasks.

The self-attention mechanism is a key component of transformers. Self-attention allows the model to weigh the importance of different input elements with respect to each other. This allows them to capture long-range dependencies and complex relationships in the data. In the self-attention mechanism, the dot product is used in calculating the attention scores and forming context-aware representations of the input elements.

The input elements (usually embeddings/tokenized versions of the input text) are first linearly projected into three different spaces: Query (Q), Key (K), and Value (V) using separate learned weight matrices. This results in three sets of vectors for each input element: query vectors, key vectors, and value vectors.

The dot product is then used to compute attention scores between each pair of query and key vectors (score_ij = q_i · k_j).

This measures the similarity between the query and key vectors, which determines how much attention the model pays to each input element with respect to all the other elements.

After computing all similarity scores, the scores are scaled and sent through a similarity function, and can then be used to compute a context function, which is again a simple sum over attention scores and values: (context_i = Σ (attention_ij * v_j))

The usual choice for the similarity function is the softmax, which can be thought of as a kernel function, a nonlinear transformation that enables the comparison between elements, and that estimates which elements might be useful for prediction. Other kernel functions are possible, depending on the problem at hand. In a more fundamental sense, transformers can be viewed as kernel machines (more precisely as Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines, as this paper discusses).

As with the other examples, the dot product, combined with nonlinear transformations of the input and output text and projections, defines the self-attention mechanism.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: