NTXent Loss  A Quick Overview
The Normalized Temperaturescaled Cross Entropy loss (NTXent loss), a.k.a. the “multiclass Npair loss”, is a type of loss function, used for metric learning and selfsupervised learning. Kihyuk Sohn first introduced it in his paper “Improved Deep Metric Learning with Multiclass Npair Loss Objective”. It was later popularized by its appearance in the “SimCLR” paper by the more commonly used term “NTXent”.
In this article, I will cover the following points:

NTXent Loss Definition

Highlevel Overview

Deeper Dive

Pseudocode

Implementation
NTXent Loss Definition
Eq 1: NTXent loss as given in the SimCLR paper
Despite the intimidating look at a glance, the equation is fairly simple. The NTXent loss denotes the loss that is observed for a pair of samples i and j.
To understand what’s happening here, let’s first get a highlevel understanding of what the equation is giving us. This will be followed by a section going into a bit more detail.
A HighLevel Overview of NTXent
Let’s look at the fraction. The fraction as a whole may seem familiar to those who’ve come across the softmax activation function. The difference is that, instead of getting the ratio of the exponentials of each output node against the other output nodes (as in softmax), in this case, we’re exponentiating the similarities between two output vectors against the similarities between one output vector, and every other output vector. Thus, a high similarity between i and j would result in the fraction taking a higher value compared to pairs with lower similarity.
However, what we need is the opposite of this. We need pairs with higher similarities to give a lower loss. To achieve this, the result is negated.
The next question you may have is, “Why do we need the log
term?”.

For one thing, it ensures that the loss we get is positive. The fraction is always less than one and the log value of a number less than one is negative. This, along with the negation mentioned earlier, results in a positive loss.

The other thing is that you can think of it as something to counteract the possibility of very large exponentials causing very small fractions. Changes in very small values may not be too noticeable to the network, and thus, scaling using
log
is bound to be favorable. 
And the final reason is,
It just works
A Deeper Dive
Cosine Similarity
sim
gives us the (cosine) similarity between the vectors zᵢ and zⱼ. These vectors are usually the output of some neural network. To put it simply, the smaller the elementwise difference between the two vectors, the higher the resulting value.
Another thing to note in the below equation (Eq 2) is that, because we divide by the magnitudes of the two vectors, the cosine similarity can be considered as an L2 Normalization
. Empirically, this, along with the temperature (τ
), has been shown by the SimCLR paper to lead to significant improvements in contrastive accuracy.
Eq 2: Cosine similarity
Temperature
The value of this similarity is divided by a value denoted by τ
(tau), a.k.a. the temperature. τ
is used to control the influence of similar vs dissimilar pairs. The lesser τ
is relative to 1
, the greater the difference in the value of the term exp(sim(zᵢ, zⱼ)/τ)
for similar pairs vs that for a dissimilar pair.
The Summation
Let’s pay our attention to the denominator.
The denominator of the NTXent loss
As previously mentioned, the NTXent loss is applied to pairs of samples. In the case of SimCLR, each pair is obtained by augmenting a single image. Therefore, each sample has one positive sample and 2(N  1)
negative samples. Consequently, to loop over all possible outcomes, we loop from k=1
to k=2N
, and simply avoid the case where both k and i are referring to the same image. This is achieved using the following term.
1 if k is not equal to i, else 0
Pseudocode
SimCLR pseudocode
The above pseudocode is that of the SimCLR algorithm. However, it gives a good idea as to how one might approach the implementation of the loss.
Of the steps shown in the image, the lines that concern the NTXent loss can be summarized as follows:

For each minibatch (a batch of N samples), generate a pair of augmented samples (positive pair) and calculate the output z value.

For every possible pair in the 2N samples that were generated, calculate the pairwise cosine similarity.

For every possible pair in the 2N samples that were generated, calculate the loss.
Implementation of NTXent Loss
In the case of most deep learning frameworks, the implementations of the NTXent loss are readily available on the internet. For example, PyTorch Metric Learning offers a great implementation of the NTXent loss for PyTorch, available here. Similarly, one can find implementations for Tensorflow here or here.
Final Thoughts
The NTXent loss is gaining more and more traction with the progression of selfsupervised learning and other applications of contrastive learning. As a result, it has become essential to have a good understanding of how and why it works, in order to apply it to its strengths. If you’re interested in seeing my other AI Mathrelated content, feel free to head over to the AI Math section of the blog.