Transformers for Tabular Data (Part 3) : Piecewise Linear & Periodic Encodings

https://miro.medium.com/max/1200/0*NolLttzuGCfgVrCb

Original Source Here

Transformers for Tabular Data (Part 3): Piecewise Linear & Periodic Encodings

Photo by Pawel Czerwinski on Unsplash

Introduction

This is the third part in my exploration of Transformers for Tabular Data.

In the Part 2 I’ve described linear numerical embeddings and how they are used in the FT-Transformer model. This post is going to explore more complex versions of the numerical embeddings, so if you haven’t read the previous part, I highly recommend starting there and returning to this post afterwards.

FT-Transformer. Image by author.

As a reminder, above you can see the architecture for previously explored FT-Transformer. This model first embeds both numerical and categorical features and then passes these embeddings through the Transformer layers to obtain final CLS token representation.

Embedding of numerical features is a relatively new research topic, and this post is going to deep-dive into two newly proposed numerical embedding methodologies — Piecewise Linear Encoding and Periodic Encoding. Both of them were described in the paper by Gorishniy et al. (2022) called On Embeddings for Numerical Features in Tabular Deep Learning. Make sure to check it out after going through this post!

If you’re interested in simply applying these methods, then head over to the practical notebook where I show how to use them with tabtransformertf package. If you’re interested in how these methods actually work, then keep on reading!

Numerical Embeddings

Numerical embedding layers transform a single float into a dense numerical representation (embedding). This transformation is useful because these embeddings can be passed through Transformer blocks together with the categorical ones which adds more context to learn from.

Linear Embeddings

Linear embeddings. Image by author.

As a quick recap, Linear embedding layers are simple fully connected layers (optionally with ReLU activation). It’s important that these layers don’t share weights between each others, so there’s one embedding layer per numerical feature. For more information, read the previous post about the FT-Transformer.

Periodic Embeddings

The idea of periodic activations is quite prevalent in ML right now. For example, periodic encodings in the Transformer architecture allow the model to represent position of words in a sentence (you can read more about it e.g. here). But how exactly can it be applied to the tabular data? Gorishniy et al. (2022) propose the following equation to encode a feature x:

Periodic encoding equation. Source: Gorishniy et al. (2022)

Let’s try to unpack this approach. There are three main steps in the encoding process:

  1. Transformation into pre-activation values (v) using a learned vector (c)
  2. Activation of values (v) using Sine and Cosine
  3. Concatenation of Sine and Cosine values

The first step is where the learning happens. The raw values of a feature get multiplied by learned parameters c_i where i is a dimensionality of embeddings. So, if we choose embedding dimensionality to be 3, there will be 3 parameters to learn per feature.

For an illustrative example, consider a randomly generated feature below.

Random feature distribution. Plot by author.

Using three different c parameters, we can transform it into three pre-activation values (i.e. embeddings with dimensionality of 3).

Periodic pre-activation embedding of the random feature. Plot by author.

Then, these pre-activation values get transformed into post-activation values using Sine and Cosine operations.

Periodic post-activation embeddings of the random feature. Plot by author.

As you can see, the slope affects periodic activations frequency. Pre-activations with larger slope (blue line) have post-activation values that have higher frequency. On the other hand, pre-activation values with small slope (green and orange lines) result in low-frequency activations. Judging from the diagram above, a feature value of 1 would be encoded approximately as [-0.98, -0.90, 0.85] and -1 would be encoded as [0.97, 0.8, -0.9] .

Periodic embedding. Image by author.

The authors also suggest adding an additional linear layer on top of the periodic encoding, so the final embedding diagram looks as displayed above.

Piecewise Linear Encoding (Quantile Binning)

This embedding method takes inspiration from one-hot-encoding, a popular categorical encoding methodology, and adapts it to the numerical features. The first step in this process is to split a feature into t bins. The authors suggest two splitting methods — quantile and target binning. This section describes the first method and the second method will be covered later.

Quantile binning is relatively straight forward —we split our feature into t equal width bins. For example, if we want to end up with 3 bins (i.e. t = 3), our quantiles to calculate are — 0, 0.33, 0.66, 1.0.

Quantile binning of the random feature. Plot by author.

Each quantile (Bt) gets represented as a tuple — [bin_start, bin_end) , so in this case we end up with 3 bins — [-3.85, -0.41), [-0.41, 0.44), [0.44. 3.26) . Formula notation for this representation is as follows:

Bins formula notation. Source: Gorishniy et al. (2022)

Once we have obtained these bins, we can start encoding the feature. Formula for encoding is presented below.

PLE formula. Source: Gorishniy et al. (2022)

As you can see, for each value we’re going to end up with a t dimensional embedding. There are 3 overall options — 0, 1, or something in-between. After applying this formula for each bin and for each value, our embeddings end up looking like this:

PLE embeddings of random the feature. Plot by author.

As you can see, smaller values have only one “active embedding” (PLE 1), in the middle we get PLE 2 active as well. Finally, in the last bin, all three embeddings get activated. This way, value of -1 turns approximately into [0.8, 0.0, 0.0] and 1 transforms into [1.0, 1.0, 0.2] .

PLE embeddings with quantile binning. Image by author.

Target Binning Approach

Target binning involves using the decision tree algorithm to assist in construction of the bins. As we saw above, the quantile approach splits our feature into the bins of equal width but it might be suboptimal in certain cases. A decision tree would be able to find the most meaningful splits with regards to the target. For example, if there was more target variance towards the larger values of the feature, majority of the bins could move to the right.

PLE embeddings with target binning. Image by author.

Reported Results

Extract from results reported by Gorishniy et al. (2022)

The paper did an extensive comparative study of all the proposed embedding methods combined with MLP, ResNet, and Transformer architectures. In this table L stands for Linear, Q stands for Quantile, T stands for target, LR stands for Linear with ReLu, and P stands for periodic.

As can be seen from the table above, there’s no single winner across the dataset (No Free Lunch Theorem in action), hence the embedding type might be treated as yet another hyperparameter to tune. Nevertheless, most of the times we see a significant improvement in performance when we compare Periodic and PLE encodings with simple linear embeddings.

Validating Results

Let’s see if we can re-create the results from this paper on a popular toy dataset — California Housing. You can see the full working notebook here, whereas below I’ll cover the main parts necessary for modelling. Like in the previous posts, I’ll be using my tabtrasnformertf package (please give it a star ⭐️ if you like it) which you can easily install (or update) using the command pip install -U tabtransformertf .

Data Download and Pre-processing

We can download the data using sklearn repository of toy datasets. The pre-processing procedure is quite simple — doing train/val/test split, scaling the data and transforming it into the TF Dataset.

Periodic Embeddings

To use the periodic embeddings, all you need to do is to specify it in the numerical_embedding_type parameter. All the other parameters I’ve already covered in the previous post, so please refer to it if you have any questions.

PLE with Quantile Binning Embeddings

The same procedure goes with PLE-Quantile embeddings. The only thing you need to change is to set numerical_embedding_type parameter to ple .

PLE with Target Binning Embeddings

If instead of quantile binning you’d prefer to use the target-based one, you’ll need to specify a few additional parameters.

  • The target (parameter y) needs to be provided for a Decision Tree to train on.
  • The Decision Tree task needs to be specified and it can be either regression or classification .
  • Additional Decision Tree parameters (ple_tree_params) can be specified as well (Optional)

FT Transformer Training

You can train the FT transformer with embeddings just like any other Keras model. Below you can see the code for training one of the models but the same steps apply for the rest of them.

Evaluation

Now that the models are trained, we can compare them to each other, to the baseline, and to the reported results in the paper. Please keep in mind that the numbers reported here are just for a single model run and train/test split, so your results are very likely going to differ (but the relative performance should stay roughly the same).

Validation loss history. Generated by author.

As you can see, for this particular dataset there’s a huge value in using more complex numerical embeddings. FT-Transformers with PLE embeddings give the best results, followed by the Periodic Embeddings.

RMSE metrics by model. Generated by author.

When we compare the results to two tree-based models — Random Forest and CatBoost, we can see that the FT Transformers with PLE embeddings outperform the first one and come close to the performance of the second one. This is quite impressive given that the dataset is small and not that deep learning friendly.

The observed performance is worse that the one reported in the paper. It is most likely due to sub-optimal hyperparameters or differences in the implementations. Also, the results reported in the paper are averaged across multiple runs, so this might explain some variation as well.

Conclusion

In this post we explored two powerful numerical embedding methods — Periodical Encoding and Piecewise Linear Encoding. You saw how they transform numerical features and how they can be used with FT-Transformer using tabtransformertf package.

While these two methods apply very different logic to embedding numerical features, both of them can be hugely beneficial for the performance of your deep learning model. The main disadvantage of these approaches is that the model might take a bit longer to train but on GPU the difference is negligible. So try them out on your dataset and let me know how it goes!

Bibliography

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: