Unsupervised Metrics: UScore & Friends

Unsupervised-Metrics is a Python library which allows researchers and developers alike to experiment with state-of-the-art evaluation metrics for machine translation. The focus hereby lies on reference-free, unsupervised metrics, which do not make use of supervision (parallel data, references, human scores) in any way. However wrappers around some (weakly-)supervised metrics like XMoverScore and SentSim are provided for convenience.

UScore: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
SentSim: Crosslingual Semantic Evaluation of Machine Translation

Installation

If you want to use this project as a library you can install it as a regular package with pip:

pip install 'git+https://github.com/potamides/unsupervised-metrics.git#egg=metrics'

If your goal is to run the included experiments (e.g. to replicate the results of UScore) clone the repository and install it in editable mode:

git clone https://github.com/potamides/unsupervised-metrics
pip install -e unsupervised-metrics[experiments]

If you want to use fast-align follow its install instruction and make sure that the fast_align and atools programs are on your PATH. This requirement is optional.

Usage

Train an existing metric

One focus of this library is to make it easy to fine-tune existing state-of-the-art metrics for arbitrary language pairs and domains. A simple example is provided in the code block below. For more involved examples and means on how to instantiate a pre-trained metric take a look at the experiments.

from metrics.contrastscore import ContrastScore
from metrics.utils.dataset import DatasetLoader

src_lang, tgt_lang = "de", "en"

dataset = DatasetLoader(src_lang, tgt_lang)
# instantiate ContrastScore and enable parallel training on multiple GPUs
scorer = ContrastScore(source_language=src_lang, target_language=tgt_lang, parallelize=True)
# train the underlying language model on pseudo-parallel sentence pairs
scorer.train(*dataset.load("monolingual-train"))

# print correlations with human judgments
print("Pearson's r: {}, Spearman's : {}".format(*scorer.correlation(*dataset.load("scored"))))

Create your own metric

This library can also be used as a framework to create new metrics, as demonstrated in the code block below. Existing metrics are defined in the metrics package, which could serve as a source of inspiration.

from metrics.common import CommonScore

class MyOwnMetric(CommonScore):
    def align():
        """
        This method receives a list of sentences in the source language and a
        list of sentences in the target language as parameters and returns
        a list of pseudo aligned sentence pairs.
        """

    def _embed():
        """
        This method receives a list of sentences in the source language and a
        list of sentences in the target language as parameters and returns
        their embeddings, inverse document frequences, tokens and padding
        masks.
        """

    def score():
        """
        This method receives a list of sentences in the source language and a
        list of sentences in the target language as parameters, which are
        assumed to be aligned according to their index. For each sentence pair
        a similarity score is computed and the list of scores is returned.
        """

Acknowledgments

This library is based on the following projects:

Citation

If you like/use our work, please cite as follows:

@inproceedings{belouadi-eger-2023-uscore,
    title = "{US}core: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation",
    author = "Belouadi, Jonas  and
      Eger, Steffen",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-main.27",
    pages = "358--374",
}

Related Projects

SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/210...

16 Apr 2021 3,374

hrq-vae

Hierarchical Sketch Induction for Paraphrase Generation (Hosking et al., ACL 2022)

11 Oct 2021 51

SimCR

Code for NAACL 2024 main conference paper "An Empirical Study of Consistency Regularization for E...

27 Aug 2023 5

TransCoder

Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf

10 Jul 2020 1,688

pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

20 Feb 2023 175

CrossConST-SR

Code for EMNLP 2023 industry track paper "Learning Multilingual Sentence Representations with Cro...

20 Apr 2023 5

SentEval

A python tool for evaluating the quality of sentence embeddings.

18 May 2017 2,082

AlignScore

ACL2023 - AlignScore, a metric for factual consistency evaluation.

24 May 2023 99

aac-metrics

Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.

20 Sep 2022 13

End-to-end-ASR-Pytorch

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation)...

08 Dec 2017 1,177

sentence-embedding-evaluation-german

Basically SentEval with German language downstream tasks

08 Apr 2022 0

codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

23 Jun 2023 61

indic_eval

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse ra...

26 Mar 2024 31

voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch

01 Aug 2023 599

wordfreq

Access a database of word frequencies, in various natural languages.

28 Oct 2013 1,362