So... The ever increasing number of papers accepted in the field of NLP and Machine Translation makes it virtually impossible for people to keep track of current ongoing research. Therefore, I have taken to writing blog posts about my failings as a researcher...

And, of course the best place to start with is my and Pinzhen Chen's paper accepted at the 2021 edition of the Workshop on Insights from Negative Results in NLP.

The premise

Machine translation suffers badly from domain mismatch issues. That is to say, when we train our model on text from news articles, we can't reasonably expect that same model to translate well a medicine textbook.

The textbook would have a completely different distribution of the most commonly used words. Some terminology that is not at all common in news articles, would be very common in those textbooks.

Due exposure bias during training, it would be difficult for the news trained model to produce those rare words during decoding. Instead, the model often prefers to hallucinate some common phrase seen in training such as yesterday's football game....

The best solution

Typically, the way to solve this problem reliably is to fine tune your translation model on some high quality parallel in-domain data before putting it out in the wild.

Unfortunately high quality parallel in-domain data is seldom available for the rare domain that you might need. Also, fine tuning has high computational cost.

The fancy solution

The fancy solutions of this problem involve training in such a way that you diminish the model's bias towards its training data so that it performs better on out-of-domain datasets.

Unfortunately such methods (minimum risk training, for example) are quite expensive to use in terms of computational cost and significantly complicate the training pipeline. Also, the results are not as good as The best solution.

The stupid shortlist (our) solution

We decided that the way to solve this problem is to limit the output vocabulary of our translation model, to a pre-selected vocabulary that better matches the domain in question. We do that by using an IBM word alignment model that would hopefully act as a regulariser..

Once we compute the word alignments we can extract a translation shortlist of words. This works sort of like a dictionary translation: When we read the source sentence, we limit the neural network to only produce words for the target sentence that are direct translations, according to the IBM models. In this way we hoped to prevent the neural network from exhibiting strong exposure bias when confused by out of domain text.

The advantage of our method is that it is much cheaper computationally than existing work. And doesn't require in-domain paralle data.

The stupid n-best reordering (our) solution

We also decided to approach the problem from a different perspective. Even if the model hallucinates high scoring translations, maybe if we increase the beam size, some of the translations down the line will be more adequate. But how do we define the notion of adequacy?

Well, we made the assumption that all translations in a big n-best list would share some similarity with each other, but the most adequate one, would be the one that is most similar to every other translation. Therefore, we produced a big n-best list and scored every candidate translation against every other translation.

The advantage of this method is that it requires no in-domain data, but is somewhat slow during translation.


The shortlist solution showed some promise, delivering several BLEU points increase in a constrained low resource setting, however it turned out that when the domain mismatch was too great, or the setting wasn't low resource, our method showed no improvement.

The reordering method showed consistent improvement in BLEU score, but upon shorter examination it turned out that it was preying on BLEU's length penalty: Our method accidentally acted as a regulariser that favoured shorter translations, not more adequate ones.

Yeah, we don't have any results here, sorry just excuses.

Why did we fail?

The main reason why our methods fail is, in short, vocabulary mismatch.

In machine translation we normally split rare words into subwords (commonly known as byte pair encoding). When we have a domain mismatched scenario, we split words very often, to the point that individual tokens become nonsensical as shown on this table.

German English
sein Pilot hat nicht die volle Kontrolle . its p@@ il@@ ot is@@ nā€™t in control .
und Z@@ eth@@ rid ? nur einen Strei@@ f@@ sch@@ uss . and , Z@@ eth@@ rid , just gr@@ aze it .

When dealing with out of domain data, we are virtually guaranteed that we are going to encounter words that the model has either seen very few times, or not at all, therefore a lot of subword splitting is warranted. In our experiments with large domain mismatch, the average sentence length after applying subword splitting nearly doubled, meaning that our lexical shortlist obtained using IBM methods would not be able to offer much meaningful information.

The idea was worth it, but it really doesn't work.

If you want to learn more about it, go read our Insights from Negative Results in NLP 2021 paper.

See ya at the workshop!


Image sources: pixabay pixabay pexels