Language is a source of great misunderstandings. Translation, even more so. Machine translation... Well we all know how that goes:

It should be dried vegetables, but nameless translation service provider knows better...

Now, jokes aside, here is the problem. Words have different meanings in different domains and contexts. It's impossible for translators to know all possible domains and contexts, so they make use of terminology dictionaries.

The WMT 2023 terminology shared task challenges researchers to apply those terminology dictionaries to Machine translation, and we answered the call with several distinct systems:

Terminology-aware neural machine translation

The main idea is that we want to teach the neural network model to accept hints from the user about how to translate certain phrases. For example, if given the input:

Was ist Ihr Herkunftsland?

The model would produce:

What is your country of origin?

Which is correct, but we may want to influence the model to produce less formal translation:

Was ist Ihr Herkunftsland __target__ homeland __done__?

So that the translation changes to:

What is your homeland?

The neural network requires a large number of GERMAN_WORD __target__ ENGLISH_WORD __done__ examples from a terminology dictionary during training, so that it can learn this behaviour. Unfortunately, we often don't have access to a good terminology dictionary, so we build one from our training data!

Word Alignment based terminology dictionary

We use an IBM model to compute word alignments of our parallel training set, and then we take all the words with bijective mappings (that is to say each source word corresponds to exactly one target word) and use them as our pseudo terminology dictionary. Then, during training we randomly expose our model to 7% of those source-target terminology pairs using the subliminal message control sequences SRC __target__ TRG __done____target__ on the source side. We do this using our awesome OpusTrainer [blog] [paper].

A real time image of me trying to give subliminal messages to my neural network.

Now, this is all good, and it works quite well in practise: We get up 75% terminology success rate using this approach, but we are not guaranteed to follow the terminology constraints: The model is free to ignore the suggestion. This is why, we built two refinement approaches on top:

Negatively constrained translation

Since at inference time we have access to a terminology dictionary, we can figure out when a terminology constraint has not been followed, as it would not appear in the translation. We then use awesome-align to figure out which word was used instead of our desired terminology word. We then perform translation again, but this time we forbid that problematic word from being produced.

If we imagine our machine translation system's vocabulary as a dictionary, negatively constrained decoding amputates select words.


This approach was quite complicated and convoluted. Since we are already in the era of LLMs, we can just use the ask-chatGPT-nicely method to refine a certain translation with the desired terminology constraints. In fact, while we are at it, we decided to try and completely ditch the neural machine translation system and perform both translation and refinement using chatGPT.

State of NLP in 2023: All pray for solutions to our lord and saviour ChatGPT!

The process goes like this:

  • Produce translation (either through our terminology-aware system, with terminology constraints, or through asking chatGPT)
  • Ask chatGPT to refine the translation, incorporating terminology constraints.

Did it work?

Sort of. We (UEDIN) submitted 3 separate systems: A terminology-aware base by itself, one enhanced with ChatGPT, and another enhanced with negative constraints. Our terminology-aware system by itself, or using chatGPT refinement produced the best tradeoff between following terminology compared to competing systems, at least according to comet automatic evaluation:

De->En En->Cs Zh->En
UEDIN_LLM 0.813 0.869 0.757
UEDIN_TERM 0.809 0.868 0.757
OPUS-CAT 0.790 0.869 0.521
AdaptTerm 0.801 0.841 0.688
UEDIN_CONSTRAINT 0.792 0.835 0.650
LinguaCustodia 0.735 0.834 0.609
BJTU-LB 0.751
VARCO-MTForceGen 0.715
COMET-DA22 scores for all systems participating in the shared task, illustrating tradeoff between terminology faithfulness and translation quality. Higher is better.

Our negatively constrained translation performed rather poorly: just because we prevent the model from making one mistake, this doesn't mean it wouldn't make another one. Using ChatGPT to translate and then to perform terminology refinement produced the best translation quality/terminology tradeoff, but this is not a surprise, since it's an unconstrained system. Our terminology-aware translation system did well, losing only to ChatGPT.

We have a lot more details in our paper, so please check it out! We worked really hard for it 🥹! You should also check out the shared task overview paper. Don't forget to cite us!

      title={Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting}, 
      author={Nikolay Bogoychev and Pinzhen Chen},
      booktitle = "Proceedings of the Eight Conference on Machine Translation (WMT)",
    month = dec,
    year = "2023",
    publisher = "Association for Computational Linguistics",

Nick and Pinzhen

Image sources: Google Deviant Art pixabay pexels