The tutorial run on 22.11.2021. A recording of the live session is available on Youtube. Also if you have any questions, don't hesitate to email me.


Download and install sacrebleu:

pip3 install sacrebleu

Getting marian

Download and install Marian:

git clone
cd marian-dev
mkdir build
cd build
make -j4

Note that marian requires intel-mkl for CPU decoding and CUDA for GPU decoding. Please make sure that you have MKL installed on your local machine before compiling. The package name could differ between distros. For Ubuntu 20.04 please install this. Alternatively, you can use do this to install it directly on ubuntu 16.04 or newer:

wget -qO- '' | sudo apt-key add -
sudo sh -c 'echo deb all main > /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update
sudo apt-get install intel-mkl-64bit-2020.0-088

More details on the whole Marian and MKL issue are listed on Marian's website.
If you have nvidia GPU and CUDA installed on your local system, you can switch the COMPILE_CUDA flag to ON.

Marian should compile cleanly on Linux, WSL and Mac. For Mac users, you want to set up a dev environment and use the built in Apple accelerate framework by providing the cmake flag -DUSE_APPLE_ACCELERATE=ON.

Getting the data

Download and extract the test models tarball.

tar -xvf mt_marathon.tar.gz
cd mt_marathon

Note that might be slow to respond due to the number of concurrent users, so you could try instead this mirror link:


Now your directory structure should look like this:

$ tree
├── enes.student.tiny11
│   ├──
│   ├──
│   ├──
│   ├──
│   ├── config.batched.shortlisted.8bit.yml
│   ├── config.batched.shortlisted.yml
│   ├── config.batched.yml
│   ├── config.yml
│   ├── lex.s2t.bin
│   ├── lex.s2t.gz
│   ├── model.intgemm8.bin
│   ├── model.npz
│   └── vocab.esen.spm
└── enes.teacher.bigx2
    ├── config.batched.yml
    ├── config.yml
    ├── model1.npz
    ├── model2.npz
    └── vocab.esen.spm

MT decoding

In this section we will gradually try different marian settings and models, starting from the slowest and progressing to the fastest.

The teacher model

The teacher model refers to the highest quality system that you train for any translation task. This is the system trained with "all bells and whistles", but unfortunately it is also quite slow. We will also talk about training it later on.

In our system, the teacher model is an ensemble of 2x transformer-big for English-Spanish.

This is a fairly big model and I do not recommend that you try to run it now during the tutorial. You should definitely run it later on, on a cluster just so that you see how much time it takes to translate. I will publish the results here for test runs on my machine. The basic configuration of the teacher model is found by looking through config.yml file and the script. I will summarize and explain it here.

$cat config.yml 
relative-paths: true
models: # Selects models for ensembling
  - model1.npz
  - model2.npz
vocabs: # Selects vocabulary for each model
  - vocab.esen.spm
  - vocab.esen.spm
beam-size: 4 # beam search size
normalize: 1 # length normalisation
word-penalty: 0 # length penalty

And the script:

#!/usr/bin/env bash

MARIAN=../../marian-dev/build # Path to your marian installation


mkdir -p basic

sacrebleu -t wmt13 -l $SRC-$TRG --echo src > basic/newstest2013.$SRC # get the test set using sacrebleu

# Call marian
echo "### Translating wmt13 $SRC-$TRG on CPU. Extra flags $@"
$MARIAN/marian-decoder -c config.yml $@ \
    --quiet --quiet-translation --log basic/gpu.newstest2013.log \
    -i basic/newstest2013.$SRC -o basic/basic.newstest2013.$TRG

# Print the time it took for translation, and the BLEU scores
tail -n1 basic/gpu.newstest2013.log
sacrebleu -t wmt13 -l $SRC-$TRG < basic/basic.newstest2013.$TRG | tee basic/basic.newstest2013.$TRG.bleu

Baseline system

Run the baseline system, replacing N with the number of real cores your CPU has.

$ cd enes.teacher.bigx2
$ ./ --cpu-threads N
### Translating wmt13 en-es on CPU. Extra flags --cpu-threads 16
[2021-11-19 10:45:40] Total time: 4597.81777s wall
 "name": "BLEU",
 "score": 36.5,

This script and all following scripts will output translation time and the BLEU score of the test set. I have trunkated the output in the interest of readability

Mini batch size

The problem with the baseline system is that we are only ever translating one sentence at a time, which makes our matrices tall and skinny, which is not something that modern hardware likes. Instead, we're now going to group sentences to be translated together. We do that by augmenting the configuration with the following options:

mini-batch: 16 # Sentences to be translated together
maxi-batch: 100 # Look at the next 100 sentences to sort sentences with similar length
maxi-batch-sort: src # Sort by source sentence length
workspace: 4000 # Memory budget per worker.

You could also add those options directly to the Marian command in the script as I have done. Run it with:

$ ./ --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 11:24:43] Total time: 652.14863s wall
 "name": "BLEU",
 "score": 36.5,

As you can see, specifying batching dramatically increases the translation speed.

The student model

The student model is what typically machine translation providers will run on their production services. Those models are highly optimised for speed, with architectures, that typically offload the heavy weight attention in the decoder with something faster, such as the Simplified Simple Recurrent Unit (SSRU).

The student model is trained on the output of the teacher model(s in the case of ensembling). More details about the training and the exact specifics of the architecture will be given later. There are couple of crucial things we need to know about decoding with student models:

  • Beam search is unnecessary! Quality is the same regardless of the beam size.
  • No need for ensembling either.
  • The student model is tiny compared to the teacher: Less than 10 times the number of parameters!
  • For more information, check this paper.

All of that allows the student to produce translations much faster, at a small cost of BLEU.

The config of a basic student model is similar to the that of the teacher, except, that the beam size is set to one. Furthermore, since the beam size is one, we don't actually need to compute the expensive softmax, but we can instead just take an argmax:

$ cd enes.student.tiny11/
$ cat config.yml 
relative-paths: true
  - model.npz
  - vocab.esen.spm
  - vocab.esen.spm
beam-size: 1 # Beam size to one
normalize: 1.0
word-penalty: 0
max-length-factor: 2.0 # The target sentence shouldn't be longer than 2x the source sentence
skip-cost: true # Do not compute softmax but instead take argmax


To run the baseline system, just do:

$ cd ../enes.student.tiny11
$ ./ --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:03:11] Total time: 84.66556s wall
 "name": "BLEU",
 "score": 35.2,

Mini batch size

Just like in the teacher case, we can significantly improve translation time by specifying larger mini-batch size

$ ./ --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:03:11] Total time: 11.66556s wall
 "name": "BLEU",
 "score": 35.2,


Further improvements in the translation speed may be achieved by avoiding the largest multiplication at the output layer, by supplying the model with a lexical shortlist. The lexical shortlist filters the output layer to only contain words that are deemed to be likely translations of the input sentence, potentially reducing its size from 30k to about 1k. We will walk through the construction of a lexical shortlist, but you can see the speed improvement from the example script:

$ ./ --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:05:03] Total time: 8.41856s wall
 "name": "BLEU",
 "score": 35.2,

The only change we made is the inclusion of the shortlist parameter inside config.batched.shortlisted.yml:

    - lex.s2t.bin
    - false

The lexical shortlist is similar to a dictionary containing frequently associated source and target words. We will look at how it's trained later in this tutorial.


To further improve performance we can perform the neural network inference in lower precision numerical format, such as 8 bit integers, which runs much faster on CPUs compared to plain old floats. To do so, we must first covert our model to 8-bit integer format:

$MARIAN/marian-conv -f model.npz -t model.intgemm8.bin -g intgemm8

and then decode with it:

$ ./ --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:05:58] Total time: 7.12242s wall
 "name": "BLEU",
 "score": 35.0,

That's all exciting right? Now, let's dive deep down, and see how we can get to those student models

Results summary

Here is the summary of the results that I run:

System, 16 threads, 3000 sentences time BLEU
Teacher basic 4597s 36.5
Teacher batched 652s 36.5
Student basic 84s 35.2
Student batched 11s 35.2
Student batched shortlisted 8s 35.2
Student batched quantised shortlisted 7.1s 35.0

Unfortunately, as the systems get faster, the runtime between different systems gets muddled due to initialisation time overhead. In order to show better the effect of all our options on the runtime, I will repeat all student experiments using 1 thread:

System, 1 thread, 3000 sentences time BLEU
Student basic 189s 35.2
Student batched 38s 35.2
Student batched shortlisted 27s 35.2
Student batched quantised shortlisted 21s 35.0

We get a huge gain when going from unbatched to batched translation, regardless of the model size. We gain about 30% efficiency from shortlisting and then another 23% from quantising to 8 bit integers. We do sustain some drop in BLEU in the meantime though.

How to train your own efficient model

In this part of the tutorial we will go through the steps necessary for preparing efficient machine translation system. Due to time and resource constraints (AKA training taking for-fuckin-ever).

Training a good teacher

A good student can not possibly hope to learn without the help from a marvelous teacher.

Clean your data!

Before starting training do your customary data cleaning. Most people use custom scripts for this which are tailored towards the specific language pair, but the bare minimum can be achieved using good old moses clean-corpus-n.perl:

$ mosesdecoder/scripts/training/clean-corpus-n.perl data/corpus.uncleaned $SRC $TRG data/corpus.tok 1 100

In this particular example, sentences with length of less than 1 token or above 100 tokens are excluded. Modify this accordingly.

Train a reverse model for back-translation

You can skip this step if you do not have any monolingual data

In order to train a high quality system, we usually take advantage of the available monolingual resources in the target language. This is done by training a translation model in the reverse direction and then translating the monolingual corpora with it. For more details, please check this paper. A typical marian configuration for the purpose would look like this:

devices: 0 1 2 3
workspace: 12000
log: model/train.log
model: model/model.npz
  - train.clean.en
seed: 1111
  - model/vocab.deen.spm
  - model/vocab.deen.spm
task: transformer-base
  - 32000
  - 32000
shuffle-in-ram: true
# Validation set options
  - dev.en
valid-freq: 5000
  - ce-mean-words
  - perplexity
  - bleu-detok
disp-freq: 1000
early-stopping: 10
beam-size: 6
normalize: 0.6
max-length-factor: 3
maxi-batch-sort: src
mini-batch-fit: true
valid-mini-batch: 8
valid-max-length: 100
valid-translation-output: model/valid.bpe.en.output
keep-best: true
valid-log: model/valid.log

This configuration assumes a 16 GB GPU. If you have a smaller GPU, please adjust down the workspace accordingly.

After the model is trained, translate your monolingual data using the output-sampling option which has shown to produce better results. Furthermore, monolingual data might be dirty, so make sure you set max-length and max-length-crop crop options. You should append those to your configuration file used for translation:

output-sampling: true
max-length: 100
max-length-crop: true

Some works suggest that synthetic data should be tagged when appending it to the true data. How much backtranslated data to use is an open question. Generally the more the better, although you may want to balance/upweight/upsample true data if it is too little compared to the backtranslate data. Refer to Facebook's wisdom.

Train the teacher

Once you have your synthetic backtranslated data (optional) and your parallel corpora, you can proceed to training your teacher model.

Configuration choice

You could train your teacher with two separate configuration prefixes: Either task: transformer-base or task: transformer-big. As a rule of thumb, if you have a high resource language >5M sentence pairs, you will likely see gains from using transformer-big.

  • For transformer-base you can reuse the configuration setting posted earlier.
  • For transformer-big, adjust down the workspace to 10000 and add the option optimizer-delay: 2 to the configuration.

No need to adjust any other configuration settings, as the task alias takes care of assigning the rest of the model hyperparameters.


One very easy way to improve translation quality of the teacher is to produce an ensemble of systems that produce translation together. This is done by training identical systems, initialising them with different random seed. The more systems, the better, although returns are diminishing.

For example, if we want to have an ensemble of two systems, we need to separate configuration files for training, where the seed parameter is different. Configuration one would have seed: 1111, whereas configuration two would have seed: 2222. At decoding time, don't forget to load all produced models as shown earlier in the tutorial.

Training the student

The student model is trained to approximate the teacher distribution. In this manner we can achieve translation quality approximating that of a teacher model, at a just a fraction of the computational cost. For more information, check this paper. Up to date guide with code and scripts for training student models can be found on github.

Producing the distilled training data

The training data for the student is produced by translating the complete training set with your teacher ensemble. This is a cumbersome task, because the teacher model is heavyweight. I recommend that you use the same settings as the ones discussed earlier in this tutorial.

It is possible to quantise the teacher model(s) before translating, but depending on the language pair and configuration, this might lead to a substantial drop in the quality of the translated data. More details later.

If you are decoding on a fairly recent NVIDIA GPU, feel free to add the fp16: true to the decoder configuration, in order to use 16 bit float decoding.

More details about these will be given later in the tutorial.

Producing word alignments

Before we can proceed to training the student model we need to produce IBM model alignments using fast-align. The alignments are necessary so that our models have guided alignment, baked in, as well as the lexical shortlist.

The script that takes care of everything is located on github. You need to edit in order to put the locations to the corpora and the SPM trained vocabulary.

Training the student

Now that the teacher model(s) have translated the full training set, we can use that as the input to the student model. The student model is trained on the original source text, and the synthetic, translated target text. In our experiments so far, the student models trained used this configuration, dubed tiny:

$ cat config.yml
dec-cell: ssru
dec-cell-base-depth: 2
dec-cell-high-depth: 1
dec-depth: 2
dim-emb: 256
enc-cell: gru
enc-cell-depth: 1
enc-depth: 6
enc-type: bidirectional
tied-embeddings-all: true
transformer-decoder-autoreg: rnn
transformer-dim-ffn: 1536
transformer-ffn-activation: relu
transformer-ffn-depth: 2
transformer-guided-alignment-layer: last
transformer-heads: 8
transformer-no-projection: false
transformer-postprocess: dan
transformer-postprocess-emb: d
transformer-preprocess: ""
transformer-train-position-embeddings: false
type: transformer

and the following training script:

#!/bin/bash -v

# Set GPUs.
GPUS="0 1 2 3"


# Add symbolic links to the training files.
test -e corpus.$SRC.gz || exit 1    # e.g. ../../data/train.en.gz
test -e corpus.$TRG.gz || exit 1    # e.g. ../../data/
test -e corpus.aln.gz  || exit 1    # e.g. ../../alignment/corpus.aln.gz
test -e lex.s2t.gz     || exit 1    # e.g. ../../alignment/lex.s2t.pruned.gz
test -e vocab.spm      || exit 1    # e.g. ../../data/vocab.spm

# Validation set with original source and target sentences (not distilled).
test -e devset.$SRC || exit 1
test -e devset.$TRG || exit 1

mkdir -p tmp

$MARIAN/marian \
    --model model.npz -c student.tiny11.yml \
    --train-sets corpus.{$SRC,$TRG}.gz -T ./tmp --shuffle-in-ram \
    --guided-alignment corpus.aln.gz \
    --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 \
    --max-length 200 \
    --exponential-smoothing \
    --mini-batch-fit -w 9000 --mini-batch 1000 --maxi-batch 1000 --devices $GPUS --sync-sgd --optimizer-delay 2 \
    --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 \
    --cost-type ce-mean-words \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 \
    --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --disp-first 10 \
    --valid-metrics bleu-detok ce-mean-words \
    --valid-sets devset.{$SRC,$TRG} --valid-translation-output devset.out --quiet-translation \
    --valid-mini-batch 64 --beam-size 1 --normalize 1 \
    --early-stopping 20 \
    --overwrite --keep-best \
    --log train.log --valid-log valid.log

The script takes care of checking for all the necessary files and will fail if they are missing.

There are other possible student configuration. We have a base configuration prefix on github, which is slower than the tiny shown above but delivers better performance. Experiment with different configurations until you find something acceptable

Note that the student model will take forever to train. You want to really overfit to the outputs of the teacher, so going for many epochs over the data is necessary. You could see the BLEU score improvement stalling for many consecutive validation steps before improving again.

Quantisation fine-tuning

Finally we will talk about quantisation fine-tuning. When we quantise the model to a lower precision, we damage it. The model might not do well with that damage right out of the box so instead we are going to do fine tune it by training very briefly with a GEMM that mimics the damage from quantisation. To do this add the following to the configuration file:

quantize-bits: 8

Finetuning is really fast. The model's quality is going to start going down after a few thousand mini-batches. Make sure you have frequent validations so that you don't miss the sweet spot! (valid-freq: 200 would be good).


A significant amount of compute time is required to train an efficient student model, so we can't do that for the duration of this tutorial. However, we can show you what you can achieve in practice! Take a look at our blazing fast, privacy focused, cross-platform translation app translateLocally, which is only made possible when utilising all of the techniques outlined above:



Optimising for speed doesn't come without caveats. Translation quality drops to a certain extent. The drop is not uniform across models, so test before you deploy!

  • Quantisation affects different models differently. As a rule of thumb the smaller the student is the more it loses from quantisation, but very large teachers models have at times been shown to work quite poorly with quantisation. Always test before you deploy!
  • Lexical shortlisting is known to cause quality issues when used with very small mini-batch size. Proceed with caution when translating single sentences and using a lexical shortlist. This can somewhat be ameliorated by letting the shortlist be more conservative and thus increasing the number of vocabulary items it lets through during construction: $MARIAN/marian-conv --shortlist lex.s2t.gz 100 100 0 --vocabs vocab.esen.spm vocab.esen.spm -d lex.s2t.bin. 100 100 means take top 100 words from the vocabulary, and top 100 translations of each word according to the shortlist. Increasing this further will slow down translation, but improve quality.
  • Humans do notice the difference between teacher and student model. Also, METEOR scores which are shown to have better correlation with human judgements also favour teacher models. There is no free performance gain.

Advanced topics

In this subsection we will talk about advanced topics, that you may be interested if you are in the business of providing commercial machine translation systems or want to do a PhD in the subfield of Machine translation.

Hyperparameter tuning

Carefully tune your hyperparameters when decoding! Different combinations of models and hardware behave differently. More specifically, mini-batch: 16 is not a fast and hard hyperparameter. Depending on the CPU you have, the amount of cache it has, the amount of system memory (RAM) you have, there might be other optimal settings. Experiment with mini-batch, maxi-batch and workspace parameter until you arrive at an optimal solution for your specific configuration.

Efficient GPU decoding

Efficient GPU decoding differs from efficient CPU decoding in several key aspects:

  • GPUs need a lot larger mini-batch size to get their full potential. While for CPU decoding performance stops scaling around mini-batch of 16-24, on GPU decoding in practically scales until you run out of memory. In order to optimise for speed on the GPU, you need to push the workspace to the limits of the device memory, as well as the mini-batch size. For 24 GB GPUs like the 3090, mini-batch: 768 and workspace: 18000 are a good place to start your binary search.
  • Shortlists don't improve translation speed on the GPU. Which is great. Just ignore the shortlist
  • We have experimented with 8bit integer decoding on the GPU, but we failed to get any performance gains compared to just using float16 decoding. In order to use this mode to translate, just set fp16: true in the decoder configuration. You should get about 20% speed improvement compared to fp32 decoding, and the ability to use much larger mini-batch size.

GPUs are in general really fast, even when compared to decoding on multiple CPUs. Running against the batched student from the previous section:

$ ./ --cpu-threads 12
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 12
[2021-11-22 17:15:25] Total time: 14.46366s wall
 "name": "BLEU",
 "score": 35.2,

Running on 1 GPU (the -d 0 flag species we should run on GPU 0):

$ ./ -d 0
### Translating wmt13 en-es on the CPU. Extra flags -d 0
[2021-11-22 17:17:12] Total time: 3.61907s wall
 "name": "BLEU",
 "score": 35.2,

Providing better hyperparameters for the GPU:

$ ./ -d 0 --workspace 18000 --mini-batch 768 --fp16 --maxi-batch 3000
### Translating wmt13 en-es on the CPU. Extra flags -d 0 --workspace 18000 --mini-batch 768 --fp16 --maxi-batch 3000
[2021-11-22 17:17:51] Total time: 1.44297s wall
 "name": "BLEU",
 "score": 35.3,

A fully optimised GPU is more than 10X faster on this very small example. If we increase the size of the dataset, the GPU will easily be 100X+ faster. In our case the GPU is A100 and the CPU and Ryzen 2 EPYC processor.

Advanced quantisation options

Marian supports two types of integer backends. fbgemm and intgemm, which deliver different performance depending on the model type and architecture.

Intgemm is hardware agnostic has dedicated codepath for both very old devices (SSSE3) and very new devices (AVX512VNNI). Fbgemm on the other hand only supports AVX2 and AVX512. Intgemm's format is hardware agnostic, whereas in order to use Fbgemm, one needs to know in advance what hardware it is going to run on. marian-conv --help will give you more details.

Finally, both intgemm and fbgemm have int16 format, which is not as fast the the int8 ones, but potentially could work better in some cases where 8bit quantisation damages translation quality too much.

Using speed oriented fork of Marian

This tutorial describes what can be achieved with marian-dev master alone. There exists however an experimental version of marian focused on speed as part of the bergamot project. How to use it, together with tutorial for creating models can be found on github. If you are interested in running a GPU fork with experimenta nvidia patches, you can also find it on github.

Research directions

In this section we will briefly go over current research directions for efficient MT. This is all bleeding edge stuff that I have seen in papers, but not in practice.

Deep Encoder/Shallow decoder

The most computationally heavy part of machine translation inference is the decoder, because this is where the autoregressive part of the computation happens, whereas the computation in the encoder happens only once. Based on that it has been suggested that one can change the standard 6-6 architecture to a 12-1 without a loss of accuracy, but significantly increasing translation speed. You should experiment to discover better student and teacher architectures!

Wider models, not deeper

Once you get into the domain of very high resource language pairs (50M+ sentences), increasing the number of parameters of your neural network architecture once again becomes relevant. Experience has shown that increasing the width of the model (meaning the dimension of the embeddings/rnn/hidden layer/attention) is more stable compared to increasing the depth of the model. Very deep neural networks sometimes are unable to train at all, but very wide neural networks don't seem to suffer from the same shortcomings. If you have the data, and the compute, you should go wide, not deep!

KNN based shortlisting

As we have shown, lexical shortlisting provides a noticeable gain in inference speed, but it may lead to quality issues. IBM models are bad at capturing excessive subword segmentation or idiom expressions. As a resulting lexical shortlists produced with IBM models favour more literal translations and struggle with cases where there is a big subword atomisation. In order to alleviate this issue, the community is exploring KNN based shortlisting (refer to this and this).

Marian already supports this via the option --output-approx-knn, although the feature is still considered experimental. For starters, in order to use this, the model trained must NOT have a bias at the output layer, so the configuration option --output-omit-bias must be specified at training time.


Training a teacher model, translating the training set, and then training a student model after is very demanding in terms of computational resources. An alternative approach would be to prune the parameters of the teacher model as it is trained, reducing the model size to something of similar magnitude to a student. Unfortunately, so far student models achieve better pareto speed/quality tradeoffs than pruned models, but research is ongoing. Check out existing work on pruning whole models or just attention.

Helpful reading

Congratulations on getting this far in the tutorial! That means you are really interested in getting efficient machine translation to work. I will give a list of papers that might be useful starting points for people wanting to get more in depth into the efficient MT work.

Thank you everybody for participating in this tutorial! I hope it was helpful! Special thanks Roman Grundkiewicz for proofreading and adding suggestions to the tutorial.