Original Source Here
Hyperspherical Alternatives to Softmax
Can Hyperspherical Alternatives Compete Against Softmax?
In the context of classification problems, a softmax classifier with a cross-entropy loss is often the go-to approach. However, in situations with many classes, softmax can be slow to train as it requires an output node for every class, leading to very large output layers. For example, a network with a hidden layer size of 300 and 100,000 output classes has 30 million parameters in the output layer alone.
In applied AI settings, these types of problems occur often. An example of this is learning to match papers with authors, or product descriptions with actual products. While this looks like a good fit for a classification problem, the space of possible authors or products is so large that training it using a softmax is unlikely to be feasible. One way to circumvent this problem is to learn or pre-define prototypes for each class, e.g., prototypical networks . Using these pre-defined or learned prototypes allows you to use a constant output size, regardless of the number of classes.
In this post, we investigate Hyperspherical Prototype Networks (HPNs) , which promise a constant output size with performance comparable to — — or better than — — softmax networks. We’ll further dive into the relation between network output and performance for HPNs. And we’ll show that, while HPNs work well for relatively low output dimensionalities, softmax greatly outperforms it as the number of classes increases. This is, interestingly, exactly the opposite of what we expected, which is why we decided to write a blog post about it.
I’ll start with an explanation of some of the basics of both softmax networks and HPN networks before getting deeper into our surprising results.
When using neural networks in a single-label classification setting, the typical approach is to use a softmax network. That is, a network of indeterminate size and layers that has an output layer with |C| units, where C is the set of output classes, and a softmax activation function:
Softmax classifiers are typically trained by minimizing the cross entropy between the predictions of a network and the targets. This can be understood as attempting to maximize the magnitude of the correct output in relation to the incorrect output units.
As a softmax classifier has |C| output units, the output layer of a softmax network grows by a factor of H for each class, where H is the number of hidden units in the last layer.
Hyperspherical Prototype networks
In contrast, Hyperspherical Prototype Networks (HPN)  have an output layer with a pre-defined fixed dimensionality instead of the number of classes. In theory, this means that the number of output units in an HPN can be much smaller than the number of classes, which could be efficient for large number of classes. Instead of the cross-entropy, HPNs learn by minimizing the squared cosine distance from a network output to a pre-defined prototype:
In a sense, this requires the network to learn to “rotate” an instance to the prototype of the correct class. Note that this loss function only needs the prototype of the correct class, unlike, e.g., prototypical networks, which need access to all relevant prototypes. Now, what is particular about Hyperspherical prototype networks is that the prototypes are static, i.e., not updated during training, and orthogonal. Let’s unpack why both properties are important.
Diving into Static and Orthogonal properties
If the prototypes were not static, the loss function above would collapse into a trivial solution. And as we all know, if there is a trivial solution to a problem, backpropagation will find it. To understand` why this collapses into a trivial solution, consider that the loss function only causes the network to minimize the distance between prototypes and network outputs, but doesn’t specify that the prototypes representing different classes should be distinct. Hence, the network can perfectly solve any learning problem by outputting a constant vector (e.g., a vector of all zeros), and by assigning all prototypes the exact same vector. If learned prototypes are desirable, e.g., for zero- or few-shot learning, this problem can be circumvented by negative sampling .
It is less obvious why the prototypes should be orthogonal. Intuitively, two prototypes belonging to different classes should not have the same representation, as this would mean that the classes are indistinct. So, prototypes that are less similar are easier to separate. As orthogonal representations are, quite literally, the most dissimilar, they can also be thought of as optimally separable. Therefore, it is in a sense not true that the representations should be orthogonal; they just perform a lot better when they are. In some settings, such as zero- or few-shot classification, having representations that are meaningfully correlated can be hugely beneficial for performance, see, e.g., .
Because the dimensionality of the orthogonal representations is controllable, an HPN allows for flexibility in the number of output units. If the number of dimensions of the orthogonal representations is equal to the number of classes, an HPN has the same number of parameters as an equivalent softmax classifier.
Creating orthogonal representations
Making orthogonal representations seems easy, but it isn’t. As noted by , the problem of finding points orthogonal to a given set of points is still open and is known as Tammen’s problem. 
A solution was investigated in the paper that introduced hyperspherical prototype networks.  In short, they first randomly generate vectors, and then, for each vector, minimize the cosine similarity to the closest vector, while renormalizing after each iteration. In our implementation, we don’t explicitly renormalize but we add the mean deviation of the L2 norm from 1 to the loss. This causes the prototypes to be quasi-orthogonal without explicitly renormalizing. Compared to explicitly normalizing after each iteration, this has the advantage of leading to a lower loss, and therefore representations that are more orthogonal on average.
Note that training these representations through backprop is relatively efficient, but it can still take a long time, especially if we try to pack many representations in a low-dimensional space, e.g., packing 100,000 representations into a 100-dimensional space.
To show the effectiveness of generating orthogonal representations, we also compare to randomly generated representations sampled from an N-dimensional normal distribution. High-dimensional random vectors become increasingly random as the number of dimensions increases, but are never truly orthogonal. Any score difference between randomly generated representations and the trained representations thus shows the added value of the orthogonalization procedure.
The actual experiment
We are specifically interested in the trade-off between dimensionality and number of classes on the one hand, and performance and training and inference speed on the other hand. Low-dimensional orthogonal prototypes, as opposed to high-dimensional ones, are faster to train, but will maybe cost performance.
To test the difference between softmax and HPN networks, we use two identical networks, consisting of two hidden layers with batch normalization, Rectified Linear Unit activations, dropout and 300 hidden units. The only difference between the networks is the top layer. In the case of softmax, this is a linear layer with (300, |C|) units, while in the case of the HPN, the size of this layer depends on the output dimensionality of the orthogonal prototypes. All networks were optimized using the ADAM optimizer , with a learning rate of 1e-3, using early stopping on a fixed validation set.
We generate 5 different datasets, with 10, 100, and 1000, classes, respectively. For each class in each dataset, we define a center by uniformly sampling a 300-dimensional representation from the interval [-1, 1]. For each center, we then draw samples from a normal distribution with this center and a standard deviation depending on the number of classes. The standard deviations are listed in Table 1.
| n classes | 10 | 100 | 1000 |
| standard deviation | 3 | 2.5 | 1.25 |
Table 1: the standard deviations for the different numbers of classes
For each dataset, we also generate 3 sets of prototypes with 10, 100, and 1000 dimensions, respectively. These sets of prototypes allow us to investigate the effect of output dimensionality on performance. We also tried using more than 1000 classes (10,000 and 100,000), but creating orthogonal prototypes for these classes took too long, much longer than training the networks.
The F-scores on the test set, averaged over five runs, are listed in Table 2. Note that only models with as many prototype dimensions as the number of classes are comparable to the softmax model.
For 10 and 100 classes, HPNs tends to perform on par with softmax models, while for the 1000 class model, the HPN underperforms. Increasing the dimensionality of the prototypes only leads to a marginal increase in score, indicating a ceiling effect on performance. Once you have enough room to work with, adding more room doesn’t do much.
Comparing randomly generated prototypes to optimized prototypes shows some interesting patterns. For the corresponding dimensionality, i.e., 10-dimensional prototypes for 10 classes, choosing randomized prototypes lowers performance. When moving towards higher-dimensional prototypes, however, this performance difference disappears. Revisiting the metaphor above: even randomized prototypes provide the model with enough room to work with if the dimensionality is high enough. Because generating random samples from a normal distribution takes almost no time, this could be interesting for problems with many classes, in which generating truly orthogonal prototypes is infeasible.
| n classes | softmax | 10 | 100 | 1000 | r10 | r100 | r1000 |
| 10 | 91.3 | 91.2 | 92.6 | 92.5 | 78.22 | 92.5 | 92.8 |
| 100 | 84.5 | 72.9 | 86.0 | 85.3 | 73.6 | 85.7 | 86.1 |
| 1000 | 83.8 | 15.8 | 58.3 | 47.3 | 14.9 | 54.3 | 47.3 |
Table 2: The F-scores on the test set for classes and various models. r stands for “random”
The speed per epoch is comparable for all models. This is surprising, as we expect softmax to slow down compared to cosine similarity as the number of classes increases. Similarly, there is an appreciable difference between the number of epochs until convergence, but only if the number of classes is equal to the dimensionality of the prototypes. Table 3 summarizes some examples.
| n classes | 10 | 100 | 1000 |
| softmax | 50 | 146 | 181 |
| HPN dim == n_prototypes | 69 | 147 | 206 |
Table 3: The number of epochs until convergence for softmax, and the HPNs in which the prototype dimensionality matched the number of classes.
In our experiments, we saw that there is very little advantage to using HPNs. We expected them to outperform softmax when used with many classes, and to be more time efficient. Both things turned out to be false, but likely for different reasons.
Let’s tackle the speed difference first: we assumed that, because HPNs only look at the prototype of the positive class, I.e., they don’t need negative sampling, they would also be more efficient. Despite its efficiency on paper, the output dimensionality of the network needs to approximately match the number of classes, as shown above. This means that the number of operations performed in calculating the loss is approximately equal for HPNs and softmax networks. In addition, it is likely that softmax can be implemented more efficiently.
More surprising, and more critical, than the speed difference is the difference in performance. While the HPN sometimes outperformed softmax for lower dimensions, 1000-dimensional prototypes did not work at all. It is likely that the loss function to orthogonalize the prototypes is not able to pack enough prototypes as the dimensionality of the space goes up. Given that there seem to be very few advantages to using HPNs, even for settings in which they work, we think the correct response is not to improve the orthogonality of the space, but to find formulations of HPNs that are robust to correlated output representations.
Finally, we would like to add that we used artificial data because we wanted to be able to control the number of classes. In real-life settings, the difference between softmax and HPN might be smaller or larger. For example, it might be the case that we inadvertently created a problem that suited HPNs or softmax particularly well. The authors of the original paper  tried a variety of datasets, mostly image sets, and showed that HPNs marginally outperformed softmax. Critically, they did not use datasets with more than 200 classes.
We showed that, while HPNs seem to be a promising avenue for classification with many classes, their performance goes down drastically as the number of classes increases. These results cast some doubt on the utility of hyperspherical prototype networks for classification tasks.
 Snell, J., Swersky, K. and Zemel, R., Prototypical networks for few-shot learning, (2017), Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 4080–4090)
 Mettes, P., van der Pol, E. and Snoek, C.G., Hyperspherical prototype networks, (2019), arXiv preprint arXiv:1901.10514
 Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J.. Distributed representations of words and phrases and their compositionality, (2013), In Advances in neural information processing systems (pp. 3111–3119).
 Tammes, P.M.L., On the origin of number and arrangement of the places of exit on the surface of pollen-grains, (1930), Recueil des travaux botaniques néerlandais, pp.1–84.
 Kingma, D.P. and Ba, J., Adam: A method for stochastic optimization, (2014), arXiv preprint arXiv:1412.6980.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot