Research on warping contrastive loss
This technical report discusses various approaches to generating sentence embeddings, which are vector representations of text that capture the semantic and contextual meaning of sentences. The goal of sentence embeddings is to represent sentences in a compact, fixed-length format that can be used for a variety of natural language processing tasks, such as text classification, information retrieval, and semantic similarity analysis.
The report covers several common techniques for generating sentence embeddings:
A sentence embedding is a vector in space. Imagine this vector as a multidimensional lump, when we scale an element of the vector, this lump is stretched out or pressed in that direction. If we continously scale every element by different values, this can be visualised as the lump warping around a region. Effectively, this warping covers possible sentences close to the point as it touch on surrounding regions.
Warped Embeddings = Embedding * (1 + N(0.3, 0.2))
Given two sets of input embeddings, $a$ and $b$, for a batch size of $N$, the combined loss function, $\text{combined_loss}(a, b)$, incorporates the steps of normalization, warping, similarity matrix computation, softmax probability conversion, and cross-entropy calculation as follows:
Warp and Normalize Input Embeddings:
For each input tensor (a_i) and (b_i) in the batches (a) and (b):
$$ A_i = \frac{a_i}{|a_i|2} \odot (S{a_i} + \mathbf{1}), \quad B_i = \frac{b_i}{|b_i|2} \odot (S{b_i} + \mathbf{1}) $$
where (S_{a_i}) and (S_{b_i}) are samples from normal distributions parameterized by mean (\mu) and standard deviation (\sigma) specific to each element of (a_i) and (b_i).
Compute Similarity Matrix and Apply Softmax:
The similarity matrix (M) is calculated as:
$$ M = \frac{AB^\top}{\tau} $$
where each element (M_{ij}) represents the scaled cosine similarity between (A_i) and (B_j). The softmax function is then applied to each row of (M) to convert these similarities into probabilities:
$$ P_{ij} = \frac{e^{M_{ij}}}{\sum_{k=1}^{N} e^{M_{ik}}} $$
Compute Cross-Entropy Loss:
The cross-entropy loss for each pair of actual and predicted distributions is calculated as follows:
$$ \text{loss}a = -\frac{1}{N} \sum{i=1}^{N} \log(P_{ii}) $$
$$ \text{loss}b = -\frac{1}{N} \sum{i=1}^{N} \log(P_{ii}^\top) $$
Here, (P_{ii}^\top) refers to the probability that the i-th element of (b) correctly matches the i-th element of (a) after transpose, effectively capturing the b-to-a direction.
Combined Loss Expression:
The final combined loss merges the two directional losses:
$$ \text{combined_loss}(a, b) = \frac{\text{loss}_a + \text{loss}_b}{2} $$
$$ = -\frac{1}{2N} \left( \sum_{i=1}^{N} \log\left(\frac{e^{M_{ii}}}{\sum_{k=1}^{N} e^{M_{ik}}}\right) + \sum_{i=1}^{N} \log\left(\frac{e^{M_{ii}^\top}}{\sum_{k=1}^{N} e^{M_{ki}^\top}}\right) \right) $$
This formal expression encapsulates the entire process: starting from the input embeddings, applying normalization and warping, constructing the similarity matrix, converting similarities to probabilities with softmax, and finally calculating the average cross-entropy loss from both directions of similarity. $S_{b_i}$ are samples from normal distributions parameterized by mean $\mu$ and standard deviation $\sigma$ specific to each element of $a_i$ and $b_i$.
Compute Similarity Matrix and Apply Softmax: The similarity matrix $M$ is calculated as:
$$ M = \frac{AB^\top}{\tau} $$
where each element $M_{ij}$ represents the scaled cosine similarity between $A_i$ and $B_j$. The softmax function is then applied to each row of $M$ to convert these similarities into probabilities:
$$ P_{ij} = \frac{e^{M_{ij}}}{\sum_{k=1}^{N} e^{M_{ik}}} $$
Compute Cross-Entropy Loss: The cross-entropy loss for each pair of actual and predicted distributions is calculated as follows:
$$ \text{loss}a = -\frac{1}{N} \sum{i=1}^{N} \log(P_{ii}) $$
$$ \text{loss}b = -\frac{1}{N} \sum{i=1}^{N} \log(P_{ii}^\top) $$
Here, $P_{ii}^\top$ refers to the probability that the i-th element of $b$ correctly matches the i-th element of $a$ after transpose, effectively capturing the b-to-a direction.
Combined Loss Expression: The final combined loss merges the two directional losses:
$$ \text{combined_loss}(a, b) = \frac{\text{loss}_a + \text{loss}_b}{2} $$
$$ = -\frac{1}{2N} \left( \sum_{i=1}^{N} \log\left(\frac{e^{M_{ii}}}{\sum_{k=1}^{N} e^{M_{ik}}}\right) + \sum_{i=1}^{N} \log\left(\frac{e^{M_{ii}^\top}}{\sum_{k=1}^{N} e^{M_{ki}^\top}}\right) \right) $$
This formal expression encapsulates the entire process: starting from the input embeddings, applying normalization and warping, constructing the similarity matrix, converting similarities to probabilities with softmax, and finally calculating the average cross-entropy loss from both directions of similarity.
So far, this novel approach significantly outperformed both the CLIP-Contrastive (+12% Pearson correlation) and NLI (+36.5% Pearson correlation) techniques under the same conditions. The report also discusses potential improvements, such as explicitly enforcing orthogonality between dissimilar embeddings and designing the model to avoid representations with colinear elements, to further enhance the performance of the sentence embeddings.