XapaJIaMnu

Step by step pytorch performance optimisation: LLaMa 3 as a case study

Nikolay Bogoychev — Tue, 12 May 2026 08:14:17 GMT

Hi everybody,

I decided to play around with LLaMa3 based architecture, implement it in pytorch and gradually optimise it for training speed to see how far I can get (and with how much effort... The Sky is the limit after all.) Spoiler alert, I can easily reach 9X faster training!

Background information:

Why LLaMa3?

It's not a MoE, relatively simple architecture upon which we can iterate. The goal is to focus on a mature well established architecture and apply common speed up techniques, more so than cutting edge transformer variants.

Did you implement it yourself?

No, I adapted the code from this blog post.

I changed the RoPE implementation to one from the official (but now deprecated) LLaMa3 repo, and I replaced the RMSNorm implementation with the one that is provided in the pytorch module. I also modified the default model definition to something bigger, compared to what is used in the blog post, and also used the AdamW optimizer with a CosineLR scheduler, because any real NN training should use a LR scheduler on top of an optimiser.

Model definition

As I wanted to keep the model size manageable for training on a single GPU, I used a relatively small model (16 layers, 16 heads, 16 kv-heads, 1024 model dim, 256 max seq len, 1000 Epochs). In this setup, an output layer with the full LLaMa3 vocab (100k+ tokens) will be the single most expensive operation in the model and will kind'a overshadow performance improvements we can do elsewhere. Instead, I'll use a char based vocab of 68 tokens, which will better mimic a real world scenario. This resulting model has 103M parameters.

Optimisations

I'm mostly looking at increasing training speed, without affecting much the training loss. I will start with methods that don't change the model and gradually move to methods that change the model. After all, we care about getting a comparable validation loss achieved in less real time. A comprehensive approach will optimise both the code AND the model architecture.

Training set
I grabbed a training set tittled tiny_shakespeare from the repository of the author of the otiginal blog post. 80% of the 40k lines are used as a train set, the other 20% are split equally between validation and test set.

Baseline run
The baseline run produces the following scores:

...
Epoch 970 | val loss 2.222 | Time 1.173
Epoch 980 | val loss 2.171 | Time 1.174
Epoch 990 | val loss 2.213 | Time 1.173
validation loss: 2.2126235485076906

We are aiming to reduce the Time it takes to complete 10 epochs, whilst maintaining comparable loss. The mini_batch size is 10 and I'll avoid touching it for now, because that's a more trivially tunable parameter, and it seems it's big enough for my 3090.

Baseline run loss

The code for the baseline run can be found here.

Non-Model Optimisation journey

Now, let's get started optimising. In this section we will talk only about optimisation techniques that DO NOT change the model

Floating Point precision

The first thing to do is to change the default type from float32 to bfloat16, as my GPU (and any modern GPU) should support that. bf16 is half the bytes of fp32, but unlike fp16, it maintains the same dynamic range, at cost of less precision after the decimal point. This property of bf16 is very useful, because it avoids numerical issues with very large numbers, and is trivially convertible to/from fp32 as they have the same number of exponent bits.

While in very big neural networks, numerical issues may arise in gradient summation, as shown here but for our comparatively tiny case, this is a free win we can take:

torch.set_default_dtype(torch.bfloat16)
...
Epoch 970 | val loss 2.221 | Time 0.510
Epoch 980 | val loss 2.214 | Time 0.511
Epoch 990 | val loss 2.213 | Time 0.511
validation loss: 2.2131337881088258

And, just like that 2x speedup, basically for free.

Fused/Optimised operations

One of the big performance boosts we can get is from fusing operations. Fusing essentially means chaining operations together, so that we avoid writing intermediate results to memory and then reading them back. A very basic example would be if we have ReLU(X*W), instead of completing the multiplication first, writing it to memory, and then reading it again, we insert the ReLU operator inside the GEMM routine.

Writing those is complicated, and GPU architecture dependent. Luckily of us some libraries that implement those exist, such as Facebook's xformers. Note that not all optimised operators are strictly speaking "Fused". Sometimes, it's a more memory efficient implementation, sometimes the algorithm is significantly modified in a mathematically equivalent way, such as in flash attention.

The point is that those are complex custom implementations of operators that differ significantly from the pytorch implementation, come with some caveats and limitations, but usually result in significant performance improvements.

Fused SwiGLU

Optimised SwiGLU op is available in xformers and we can make use of it. We just replace our FeedForward layer implementation:

class FeedForward(nn.Module):
  def __init__(self, dim:int, hidden_dim:int, multiple_of:int, ffn_dim_multiplier: Optional[float]):
    super().__init__()
    # Models embedding dimension
    self.dim = dim

    # We must use the hidden dimensions calculation shared by Meta which is the ideal one for this model
    # Hidden dimension are calculated such that it is a multiple of 256.
    hidden_dim = int(2 * hidden_dim/3)
    if ffn_dim_multiplier is not None:
      hidden_dim = int(ffn_dim_multiplier * hidden_dim)
    hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

    # define hiddne layers weights
    self.w1 = nn.Linear(self.dim, hidden_dim, bias=False, device=device)
    self.w2 = nn.Linear(hidden_dim, self.dim, bias=False, device=device)
    self.w3 = nn.Linear(self.dim, hidden_dim, bias=False, device=device)

  def forward(self, x):
    # Shape: [bsz,seq_len,dim]
    return self.w2(F.silu(self.w1(x)) * self.w3(x))
...
self.feedforward = FeedForward(args.dim, 4 * args.dim, args.multiple_of, args.ffn_dim_multiplier)

With

from xformers.ops.swiglu_op import SwiGLU
self.feedforward = SwiGLU(1024, 2816, bias = False) # This results in the same number of parameters as the FeedForward implementation
...
Epoch 970 | val loss 2.248 | Time 0.491
Epoch 980 | val loss 2.212 | Time 0.490
Epoch 990 | val loss 2.186 | Time 0.491
validation loss: 2.1855790853500365

And we get another 5% performance boost. Note that I've hardcoded the input and output dimensions in the SwiGLU op because they are calculated slightly differently from the initial FeedForward implementation.

Fused attention.

Everybody knows that the attention mechanism is extremely expensive due to its n**2 computational complexity which scales with sequence length. It is usually the biggest blocker when it comes to computational efficiency and long context.

We are going to replace this code:

 # To compute attention, we'll need to perform a transpose operation to reshape all queries, keys and values bring heads at dim 1 and seq at dim 2
xq = xq.transpose(1,2)                  #xq[bsz,n_heads,seq_len,head_dim]
keys = keys.transpose(1,2)              #keys[bsz,n_heads,seq_len,head_dim]
values = values.transpose(1,2)          #values[bsz,n_heads,seq_len,head_dim]

# Computing attention score
scores = torch.matmul(xq, keys.transpose(2,3)).to(self.args.device)/math.sqrt(self.head_dim)
if mask is not None:
    scores = scores + mask

# Apply softmax to the attention score
scores = F.softmax(scores.float(), dim=-1).type_as(xq)
# Matrix multiplication of attention score with the values
output = torch.matmul(scores, values).to(self.args.device)

with

# To compute attention, we'll need to perform a transpose operation to reshape all queries, keys and values bring heads at dim 1 and seq at dim 2
xq = xq.transpose(1,2)                  #xq[bsz,n_heads,seq_len,head_dim]
keys = keys.transpose(1,2)              #keys[bsz,n_heads,seq_len,head_dim]
values = values.transpose(1,2)          #values[bsz,n_heads,seq_len,head_dim]
output = F.scaled_dot_product_attention(xq,keys,values, mask)
...
Epoch 970 | val loss 2.186 | Time 0.402
Epoch 980 | val loss 2.223 | Time 0.401
Epoch 990 | val loss 2.249 | Time 0.402
validation loss: 2.2491623640060423

Some slight numerical differences are to be anticipated, so the resulting convergence is not exactly the same, but it is comparable, and we shaved off nearly 20% of the runtime! On top of that, this attention uses less memory which is important in many real world scenarios where we are always maxing out the available memory.

Note that there are many efficient attention implementations. Please consult pytorch's extensive documentation on the subject. Unfortunately there wasn't a flash attention kernel compiled for my 3090 and I didn't bother to compile pytorch with it, so I didn't test it.

Techniques I didn't try

Optimisation is an endless journey and very model and hardware dependent. There are a number of things to try that are not applicable to my scenario/hardware:

Mixed precision training. Going down to fp8 and even fp4, with gradients in bf16/fp32 is commonly done when training frontier models. nvidia showcased we can train in FP4 without loss in accuracy, but accelerated fp4 hardware is available on GPUs of later generation than mine.
Different types of model sharding when doing multidevice training were not looked at, as I only have one GPU. When working with big models, especially Mixture of Experts, we have a number of different types of parallelism: Data Parallelism, Tensor Parallelism, Pipeline Parallelism, Expert Parallelism, etc... Making good use of those will minimise the gaps in our pipelines and the idling of our GPUs.
For inference, there's tons of quantisation methods out there, but I will leave them as outside the scope of this tutorial.

Results of pure code optimisation

We have roughly achieved 3X faster training by using more efficient implementation without writing any custom kernels. We can always do more, we can fuse more operators, we can use search for better hyperparameters... If we do want to optimise further the best thing to do is to PROFILE.

Profiling

When spending effort to improve performance, we want to start from the biggest offenders first and work down the line. How do we find those? Through profiling. Profiling in pytorch is easy, just refer to the documentation:

from torch.profiler import profile, ProfilerActivity, record_function
...
with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
  with record_function("model_inference"):
    train_results = train(model, optimizer, ModelArgs)
prof_key_average = prof.key_averages()
with open('/tmp/tst', 'w') as myout:
    myout.write(prof_key_average.table(sort_by="cuda_time_total", row_limit=20, max_name_column_width=1600))

Then open the resulting file with really long lines and inspect the result. For best view, download the file open in your favourite text editor and disable word wrap.

The top offender is Fused GEMM+Swiglu kernel with 1/4th of the total runtime, and attention is coming in only 4th-5th place. If you struggle with decoding the long function names, google/AI is your friend.

Model Optimisation Journey

Work smarter, not harder. If we can make a model that learns just as fast (or even faster!) in terms of epochs, but is smaller, we can get some extra wins in terms of wall clock time!

Grouped Query Attention

One common mechanism to reduce the cost of attention is to reduce the number of KV heads to a smaller amount. Maybe for every two Queries you only need a single KV pair? The intuition behind it is that the information from KV heads is largely redundant and not all of them are necessary:

MHA vs GQA vs MQA attention

If you are interested in more information, please check out the MQA and the GQA papers.

Modifying our code is trivial, because the pytorch attention implementation supports GQA with the extra argument enable_gqa=True. Let's just halve the KV heads and give it a go:

...
n_heads: int = 16            # number of heads for queries embedding
n_kv_heads: int = 8         # number of heads for keys and values embedding
...
output = F.scaled_dot_product_attention(xq,keys,values, mask, enable_gqa=True)
Epoch 970 | val loss 2.226 | Time 0.519
Epoch 980 | val loss 2.215 | Time 0.548
Epoch 990 | val loss 2.262 | Time 0.520
validation loss: 2.261601686477661

Well, shit, we got slower!? Well, turns out GQA is supported only in flash_attention, which is pytorch didn't compile for my architecture, so I fallback on the basic pytorch implementation, which is slow.

Fear not, for if the attention implementation doesn't support it, we can always use a repeat/reshape function that repeats and concatenates grouped KV heads onto themselves, so that the attention does the computation with full heads:

# If the number of keys/values heads is fewer than query heads, this function expands the key/values embeddings with the required number of repetition
def repeat_kv(x:torch.Tensor, n_rep: int)-> torch.Tensor:
  bsz, seq_len, n_kv_heads, head_dim = x.shape
  if n_rep == 1:
    return x
  return (
      x[:,:,:,None,:]
      .expand(bsz,seq_len,n_kv_heads,n_rep, head_dim)
      .reshape(bsz,seq_len,n_kv_heads * n_rep, head_dim)
  )
...
Epoch 970 | val loss 2.242 | Time 0.385
Epoch 980 | val loss 2.213 | Time 0.388
Epoch 990 | val loss 2.206 | Time 0.386
validation loss: 2.2057207822799683

Now the model is smaller, down to 94.5M parameters, and we are faster, with another 5% performance improvement. It's important to note that while GQA is faster than pure MHA, it is less expressive and reducing the number of KV heads too much will lead to obvious performance issues. GQA is especially beneficial during decoding as it will drastically reduce the memory requirements for the KV cache.

The code for the optimised run can be found here.

Deepseek latent attention

Deepseek introduces latent attention, meaning the KV matrices are downprojected to low dimensional space and uprojected to the high dimensional space during computation. In practise this means that the KV cache is tiny, less than 10% of the size compared to full KV matrices, and the model has less parameters: just 82.4M.

Deepseek attention revolution

I changed the attention of my model to deepseek style one, adapted from this blog post. The resulting code can be found here.

Epoch 970 | val loss 1.415 | Time 0.377
Epoch 980 | val loss 1.418 | Time 0.376
Epoch 990 | val loss 1.413 | Time 0.377
validation loss: 1.4130668997764588

Not only is the resulting model a few percent faster, but it converges much much faster and to a better point.

Deepseek convergence

My intuition about why convergence is faster here is that the simpler deepseek attention with less parameters is actually easier to learn and our very simple dataset does not require the expressiveness of full attention. It is remarkable what huge difference the architecture makes, we are nearly halving the loss average.

If we consider the overall speedup in terms of real time to reach similar perplexity, it is easily 3X compared to the best optimised model. If we compare to the baseline implementation, we are reaching 8-9X speedup!

Conclusion

Good optimisation is always achieved through both code and model improvements. Through code improvements alone, we can reach about 2.5X speedup, but through modelling improvements we can reach 8-9X faster convergence.

We should also look at inference performance, but this blog post is getting kind'a long so I will stop here.

Image sources: pexels ibm Deepseek

Hanzi of the day! 盲

Nikolay Bogoychev — Sun, 12 Apr 2026 08:43:00 GMT

Hi everyone!

As my Chinese improves (or at least my confidence about it, not sure language skills wise), I chance upon lots of new characters and I try to create some mnemonics for myself to learn them.

Today's character 盲 máng consists of the character 亡 (meaning lose, flee, die and deceased) on top of an eye 目, making up the lovely combination of death of the eye!

The ancient forms of the glyph show the same basic structure, except sometimes death appears on the side, as opposed to on top.

There are actually a number of characters which use this particular formula, notably:

忘 to forget, the death of the heart. The astute reader will note that I have written about it on this very blog!
妄 preposterous, the death of a woman. That one is funny, albeit it doesn't make any sense semantically. Here 亡 serves as a phonetic component, because its pronunciation wàng is almost the same as the pronunciation of 妄 wáng.

Incredibly, native Chinese speakers are seldom aware of these connections (unless they are Chinese scholars) and the reason is that first and second language Chinese learners acquire character knowledge in a very different way:

Native Chinese speakers learn to write almost exclusively as children. They are in no rush to learn the 5000 or so characters required, they take their time over 12 years of study:

Children repeatedly write characters many times over, in a specific stroke order. This primes them for a strong enactment effect, where using the correct writing sequence helps memorise the characters. There are papers about it, albeit studying second language learners mostly.
First language speakers do not need to understand the glyph origin in order to memorise it. Logogram history and logic, while very interesting (to me), is not necessary for achieving literacy

Example child hanzi exercise book from Taiwan. After completing thousands of those, memorisation is inevitable...

Furthermore, there's an inherent difference of acquiring a logographic writing system as an adult versus as a child:

Adults usually don't have the same free time to allocate to the task as children. Pesky things such as jobs, chores, raising their own kids, etc, do tend to get in the way of the pursuit of knowledge. If I had the time, and 12 years to spare, I could learn writing Chinese in the same way as school children. I might even learn a tad bit faster, due to adults in general having better pattern recognition skills.

Most adults, especially nowadays, have limited scope when it comes to learning to read and write Chinese as a second language. The digital age has ensured that we mostly need to be able to read: we can use phonetic input such as pinyin or zhuyin to write Chinese. Strictly speaking, recognising and reading Chinese characters is much more important than the ability to handwrite them. At least when it comes to hobbyists like me, I am sure second language Chinese scholars have impeccable writing skills.

Second language Chinese learners are thus incentivised to efficiently allocate their time in a way that maximises their reading skills, not their writing skills. This inevitably means poorer writing and heavier use of mnemonics, glyph origins trivia and phonetic information as tools that aid memorisation.

I personally have found that writing is incredibly helpful when it comes to acquiring reading proficiency but inevitably I have to balance the time I spend writing characters versus the time I spend memorising new vocabulary.

Death on top of a book... Surely that means ignorance. I should submit that one.

That's it from me! I will see you all next time!

Nick

Image sources: pixabay wiktionary eword.ntpc.edu.tw pexels

Hanzi of the day! 聞

Nikolay Bogoychev — Wed, 25 Feb 2026 13:37:19 GMT

Hello Everybody!

Today I chanced upon a very curious character 聞 and I would like you to all hear about it!

聞 wén consists of an ear 耳 inside of a door 門. Surely the meaning of this character has something to do with using the ears. I know a corresponding character that is a mouth inside a door 問, which means to ask a question. Using the same logic, 聞 should mean to hear. Yes! Well, almost. It means to smell.

My reaction when learning this hanzi

I immediately started investigating! There must be a reason for this:

The meaning in classical Chinese was indeed to hear, so it is one of those hanzi that changed its meaning through the ages.
In modern Chinese it still maintains the hear meaning in some words, such as:
- 新聞 news, literally translating as new hearing (I guess new smells nowadays).
- 聞名 famous, literally, a name that is heard (I guess smelled)
- The idiom 聞所未聞 unheard of.
- Another idiom 聞風喪膽 to be terror striken at the sound of the news. Lol, we all feel that one.
- Many, many other idioms.
The meaning in modern Japanese is indeed still to hear.
The oracle bone script and bronze scripts also depict it is as hearing:

A man who is covering his mouth and craning their ears to listen.

So where does the meaning of to smell come from?

There are some vague notions that smelling and hearing are close to each other as they are both senses so the meaning could jump around. Some languages often group senses together as words that share same root, or polysemantic words. This is indeed the case with some Austronesian or Tibeto-Burman languages, and maybe that's what happened in Chinese?

Or it could be phonetic changes or influence from Mon-Khmer languages?

In the end of the day nobody really knows.

Or maybe they were predicting what news The Rock would bring up. That's my favourite explanation at the very least.

Have a nice day!

Nick(y)

Image sources: personal uknonwn renlu tenor

Will concussions ever stop sucking? 3 year update.

Nikolay Bogoychev — Sun, 19 Oct 2025 06:24:00 GMT

It's time for me to give a brief overdue update at the 3.7 year mark of my concussion accident. For those interested, here are the 4 month update and the 2 year update.

It's important to document how my recovery is going, both to serve me as a reminder, that things are getting better, and to give hope to others: Things will get better!

I am almost normal... Until I am not

What is remarkable about this stage of my recovery that I now can feel like my old self at times. I can wake up without the numbness in my face, I can look at my coworker's 60 HZ monitor, I can go to a concert, I can read a bit... Well, maybe not reading...

And that's the part where it gets precarious. Just because momentarily I feel like my old self, doesn't mean that I am my old self. Inevitably, I will try to do something my old self would have enjoyed and then suffer for it:

I used to love roller coasters and lunaparks. Back in December, I got on one for the first time in years, and while it wasn't a particularly intense one, my head didn't feel right for hours afterwards.
I used to headbang, just like any other metal fan. Well, I tried to do it again at a Karaoke, and, well, it didn't go well. It took me back more than a full year, with a constant headache that, no joke, lasted months. Will not do that ever again.
I play video games. Well, most of them. Every now and then an artistic entry such as The return of the Obra Dinn will give me hours long motion sickness.
I watch animation again. Flashy new ones are fine, what trips me up are the old ones, where the animation is only 12 frames per second. No Avatar/Naruto/Ghost in the Shell for me, my brain can't process the moving pictures into a smooth movie and I get incredibly dizzy even after a few minutes of watching.
I watch movies. Most of them are fine. But some are not. Notably some of the slow zoom corridor scenes in The Substance made me feel unwell. But it's ok, since modern TVs have artificial smoothing (also known as the dreaded soap effect). It's a life saver.

Alicia sunk into a dream world to escape her broken self. I understand why she did that and it's a constant struggle not to.

In the end of the day, I am not back to normal, I am different still, but the focus is always on what is getting better. The view to the past is the path to depression.

Reading. The bane of my existence.

I used to love reading, and it sucks that I can't do much of it. My eyes are still not good at the left-right movement (thanks horizontal nystagmus), and it makes reading books incredibly challenging. The hardest thing for me are long lines with single spacing. As long as the text window is narrow enough (or the line short), and there's plenty of spacing, my eyes can focus on the text with a relative ease and it doesn't bother me.

Thanks to these little concussion life hacks, I am able to code, and just about cope with my normal work duties, albeit reading academic papers can get quite tiring. Normal paperback books are out of the question almost entirely...

I should treat reading as physio exercises for injury recovery. A page of day to keep the doctor away... And wait for that neuroplasticity to rewire my brain in just right way.

An actual image of my neuron connections, some of them hanging on pure optimism, some of them just hanging...

But it is hard to do this, because if I overdo the reading I will suffer for it: I won't be able to work for the next few days, and it's scary. It's hard to justify doing this for the recreational benefit of reading.

Other victories

It's not all gloom and doom! I am getting back to my hobbies and I am more confident at work.

I am able to read Chinese again. I resumed language classes! Recalling characters just a year ago left me with debilitating headaches even one sentence in. Now, it seems to be fine.
I feel I can do complicated things again, it feels like I finally have enough concentration to go through a complicated codebase and hold it in my head. I can't work as much as I used to, but I hope it will come back.
I am slowly looking at sports again. I've avoided them due to the potential of hitting my head, but with enough precaution, maybe it can be fine? I am tentatively looking at acrobatics, and perhaps I can even play football again, with Petr Čech style helmet.

Dude had a hole in his head and recovered, surely so can I.

Things are getting better. Maybe not as fast as I hope, but they are. I am better than 1 year ago. And 1 year ago, I was better than 2 years ago. The important thing is to not lose hope and to not overdo it. Listen to my body, relax when it tells me to relax and let it tell me which activities it can do for now and which it can't. And of course, persevere. It's not I can't do this, it's I can't yet.

Take care,
Nick

Image sources: pixabay Expedition 33 wikipedia wikipedia

Hanzi of the day! 后

Nikolay Bogoychev — Sun, 08 Jun 2025 22:34:01 GMT

A long time ago, when the eternal Elizabeth II ruled the domain, I was writing my Chinese homework and looked up the word queen in the dictionary. What I found, 王后 wánghòu, was particularly hilarious, because:

王 wáng, aside from one of most common Chinese sirnames, means king
后 hòu means behind

Technically, this character means "Queen Consort" and not an actual ruler, but still, very nice, very sexist, China; so the queen is literally defined as the one who is behind the king. The truth, however, is much more... Interesting? Worse? I'll let you decide.

A picture of a queen. She's not visible as she's behind the king...

As some of you may know, Chinese characters in mainland China went through a process called simplification, which attempted to improve the literacy rate of the population. Whether it succeeded, failed or had literally no effect is a matter of scholarly debate, but the end result is that some hanzi now have two forms: traditional and simplified.

One of the ways in which simplification was performed was by adopting a phonetic loan, and this is precisely what happened to the original character for behind. Originally spelled 後, it was simplified to 后, as they both share the same pronunciation hòu, but the latter is much easier to write.

The original meaning of 后 is queen and has ~~nothing~~ less to do with behinds. The glyph depicts a woman giving birth to the heir of the throne:

Ancient forms of the hanzi. Giving birth while sitting on the toilet? A curious depiction.

Now this is so much better. A queen is not defined by her position relative to the king, but by her 3D printing capabilities.

Looking how the glyph developed through the centuries, I can't shake off the feeling that it looks like the woman is sitting on the toilet. And indeed, some scholars argue that the glyph depicts a person and a hole, meaning rear, behind or anus. The character was sometimes used to represent this precise meaning in the oracle bone era.

If this is true though, why does it also mean queen? Does it indeed depict a childbirth? Were ancient sages unfamiliar with female anatomy? Was it up to debate where the baby came from? Had they even seen a woman? We will never know.

Have a good day!

Nick

Image sources: pixabay wikipedia pexels

EDIT:

An astute reader has pointed out that this character means a queen consort, and not an actual ruler. A queen that rules is 女王 or "Female King", which is not as bad as "Birther of Princes".

Hanzi of the day! 舞 and 無

Nikolay Bogoychev — Sun, 18 May 2025 21:45:43 GMT

Hello everyone! After some hiatus, I decided I should be more serious about blogging. After all, every blog written is one character learned! Which means that I just need to write about 4986 more blog posts before I can finally read a bloody book in Chinese without consulting a dictionary on every sentence, but I digress... Today we are going to talk about Dancing. And nothingness.

舞 wǔ means to dance and is a rather complicated looking hanzi that is taught in beginner level Chinese. The character has a ritualistic origin: A person holding ox tails (or feathers), performing a rain dance! (A bit more obvious in the historical forms of the character shown below.)

Which makes sense, dances are often an important part of rituals.

Now, nothing surprising so far, until I came across the phrase 無情 as I was reading some beginner books. I am thinking now, I know 舞 is dance and 情 is feeling, so this must be someone really happy or gracious. I check it in the dictionary just in case, and I was very surprised to see it meant heartless. WHAT?!?

An actual image of me checking the dictionary.

Well, it turns out that 無 wǔ, while visually very similar to 舞, is actually a completely different character with distinct meaning: the absense of something, nothingness, sort of similar to the English prefix un-.

Historically 無 did indeed mean to dance, but that form was ~~borrowed~~ stolen to mean negation, as this word is more common. And thus, there was no character left to dance!

A sad, empty dancefloor...

What's the solution? Slap a pair of legs 舛 under 無 to form 舞 and call it a day!

Just like in Voltron!

That's it from me, have a nice week!

Nick

Image sources: pexels wikipedia pexels flickr voltron

Hanzi of the day! 𨳒, 𨳊, 𨳍, 閪

Nikolay Bogoychev — Mon, 23 Sep 2024 00:14:14 GMT

Hello hanzi lovers!
It has been a while since I last wrote about my beloved Chinese characters. Sadly, life and work got in the way. Incidentally, it is my latest work that brought me to these characters. Don't ask..

Chinese characters in their purest form represent ideas about concepts, hence why they are called ideograms. Let us figure out the meanings of a 𨳒, 𨳊, 𨳍 and 閪 by only looking at the images.

The first thing that comes to mind is that there is a common element, 門. The meaning of this element is a door.

It resembles a classic cowboy style saloon door.

Once characters are created, they could change in meaning, or assume new meaning in certain context. This mix-n-match strategy is more tractable than inventing a new character for every single concept. In this case, the meaning of 門 refers strictly to a body part (or rather a section of the body). I am sure everyone will eventually figure out that it represents the hips that, according to Shakira, do not lie.

What about the remaining components? 小, 九, 七 and 西 have their own, individual meaning, but the important part here is that their pronunciation in Cantonese (siu2 gau2 cat1 sai1) rhyme with diu2 gau1 cat1 hai1. When put together with 門 they represent what is located between the hips and that's how you get and that's how you get 𨳒, 𨳊, 𨳍 and 閪. incredibly even though the primary purpose of 小, 九, 七 and 西 is to serve as a phonetic component, they also make sense from a pure ideographic perspective as well!

The sheer ingenuity!

𨳒 diu2 originally referred to the male genitalia (obviously), but the meaning later evolved to the verb fuck, the bread-and-butter of insults. It is used in 𨳒你老母 diu2 nei5 lou5 mou2, probably the most well known Cantonese phrase among the western population. I apologise on the behalf of white people.
𨳊 gau1 represents the erect penis meaning cocky or just plain stupid. A friend of mine pointed out that it is specifically an erect penis when it's not supposed to be which also conveys the meaning of inexperience or immaturity. Pay attention to the upward pointing bit of 九 and consider how it contrasts with 七, used in 𨳍.
𨳍 cat1 on the other hand corresponds to the impotent penis, because as we all know there is no bigger insult you can hurl towards a man. It means something that ugly or shameful, but appears in variety of other insults.
閪 hai1 is dedicated to the female genitalia, and is used in a context similar to cunt.

I am honestly genuinely impressed that working and not working penis deserve two separate characters. Evidently distinguishing between the two is important enough to deserve its own vocabulary item. Also, whole 3 characters for male genitalia, but only one for female? Sexism much?

Equality for all! More female genitalia characters!

Those are relatively new characters, used exclusively in Hong Kong and Macau. They don't really work in Mandarin because the phonetic elements don't make sense and you would not find all of them in Mandarin centric dictionaries.

In Mandarin, the construction of those words has occurred in a somewhat similar fashion, using the body 尸 radical plus a phonetic element:

屌 diǎo literally something hanging 吊 from the body 尸. Quite descriptive. It is likely derived from 鳥 niǎo, as in Chinese words for birds also mean penis. Go figure.
屄 bī literally a body 尸 + a hole 穴.

This is by no means an exhaustive list of penis and vagina characters, as the list in both Mandarin and Cantonese could go on and on. Nevertheless I hope you enjoyed and appreciated to learn how logophonetic characters are constructed in a way that also makes sense if considered purely as an image!

Until next time!

Nick

Image sources: flickr wikipedia pixabay pexels

Concussions still suck: 2 year update

Nikolay Bogoychev — Sun, 18 Aug 2024 21:23:03 GMT

It has been two years since my previous concussion post, and I should give an update. Why you ask? It's not (only) because of narcissism but because I actually received emails from readers of my blog asking me whether I got better.

The thing is, bad concussions, especially the ones coming with post concussion syndrome, are extremely tough on one's mental health. Since doctors can't give us, headbangers, a silver bullet solution to our predicament, we inevitably scour the Internet for information. Most importantly though, it is not the information we are all looking for. It's hope. I know I was.

Light in the end of the tunnel, while not guaranteed, is probable.

It's important to update to story and tell people the good and the bad, because things do get better and there is always hope.

The neurologist

In July 2022, since I was no where near getting better, I visited a neurologist and had an MRI scan done. The MRI apparently showed that I had a small gliotic focus, a physical manifestation of banging up one's brain. According to the Internet it's akin to something like scar tissue, but in your brain, with protective, not fully understood function?

I asked the neurologist about it and he told me not to worry, take some rest and a chill pill. I mean, I know rest is important but surely there must be something else I can do, right? Well, he did actually prescribe me something.

It seems that, according to doctors 99% of your problems will just go away if you simply just ignore them. I wonder if concussions aren't in the other 1% though...

The drugs

I was prescribed Piracetam to take twice a day. Again, reading up on the Internet, its mechanism of action is to increase blood flow to the brain and supposedly helps older people in cognitive decline, which, technically, I was. My brain couldn't cope with basic functions such as reading and looking at a screen.

I was worried about taking it for prolonged periods of time, especially given the lengthy list of side effects including but not limited to tremors, anxiety, insomnia, hypersexuality... But then I found reassurance in the most unlikely of places: tech bros.

Piracetam is considered a nootropic drug that Silicone Valley bio hackers take 3-4-5 times a day in order to get that little extra edge over their peers. This is like 300% my prescribed dose and apparently those people are just fine?! God, that place is a dystopian nightmare, but I digress.

This series is way too accurate.

The light

I started taking Piracetam and I immediately got the tremors and the jitterness. It feels a bit like having too much coffee, except your heart is not affected. But boy did it help with EVERYTHING. I spent a full day coding. Yeah it was extremely exhausting, much more so than I was used to, but I managed to do work. No crippling dizziness, no noise in my head, just a short spell of... Normalcy.

The darkness

Even one pill was too much for me, in terms of side effects, so I decided I'd be doing only one a day. Even so, the insomnia was horrible. I could not fall asleep at all. Maybe 10-15 minutes here and there and the rest of the night was spent staring at the ceiling contemplating the universe. I read online that the side effects gradually disappear as your body adjusts, and this was true, but also so did the positive effects...

I started playing video games again in August 2022, 5 months after my accident, and I had to stop again in October because my brain wouldn't allow for it anymore. I did watch a few animes, but by October that also became unbearable. I was despairing again about my predicament, feeling like I would never be normal again.

My feeling at the time

I was definitely getting better, but not as fast or as much as I wanted. There were always good days and bad days. At best I was enjoying a challenging but functional day at work and then collapsing from exhaustion once I came home. At worst, I would be unable to understand what other people were speaking to me, somehow make it to my home and wake up a few hours later completely unable to recognise where I was with no memory of how I got there. I was in my bedroom, in my bed, in my home of 5 years. That shit still haunts me.

The breakthrough

Long haul flights were a bit of a challenge as I couldn't really make use of the in flight entertainment. On one particularly boring flight I noticed a curiosity. I was at an isle seat in the left part of the plane and if I looked on my right I could watch the movies other passengers were enjoying (almost) without any ill effect, but if I looked on my left side, it was very obviously much worse. Also, the further away the screen was from me (2 isles in front of me, even 3) the better. Apparently the damage was on the right side of my brain (actually I have that information from my MRI but didn't connect the dots earlier). I tried covering up my left eye and indeed I felt a momentary relief.

What a remarkable discovery!

Eureka!

Once I got back from my brief stint abroad, I immediately put my monitor on the right side of the desk and moved my computer sitting spot about 2 meters away from my screen. And then! WoW! It just, worked. I was able to spend more and more time on the computer doing more visually intense things.

Over the next few weeks I slowly started to reduce my distance away from the computer in an effort to teach myself how to be normal again. I watched my first movie since the accident (Everything everywhere all at once), sitting sideways on my couch.

I went to the cinema for the first time since the calamity, sat sideways and enjoyed a movie (Dungeons and Dragons). I looked weird, people looked at me, but I didn't care... I WAS AT THE MOVIES AGAIN! I was living again. (Technically, I was living beforehand as well, but... ups and downs, depression and all.)

It's still not over

Right now, I am fully functional. I can't work as much as I did before, but I can do a job and be considered a productive member of society. I still can't enjoy reading though..

Books are really hard for me. My eyes just refuse to do that left-right movement and I have to go about it very slowly. One week I decided to push through the discomfort in order to read the Three body problem and... It was bad. I relapsed completely with brain fog, memory gaps, extreme tiredness and inability to work, inability to even look at a screen.

This episode lasted about a month and I am extremely wary of reading now. I need to slowly introduce it to my delicate brain. It's even worse for learning languages. Reading in a foreign language triggers numbness on the left side of my head even after just a few minutes. I live knowing full well that if I am not careful I will relapse and go back months with my progress.

Last December I decided to go on a carousel to see if I could potentially go to a lunapark. Spoiler alert: I can't. I have vivid memories of that night lying on my bed and wondering when the ceiling would stop spinning so I could finally bloody fall asleep...

Disclaimer

Every concussion is individual. This is what worked for me, but it's not necessarily what would work for another person in a similar situation. First, listen to your doctor and to your body. If your body is telling you something is bad for you, don't do it.

But most importantly don't despair! There is hope. Do the things that you can do, and look for new hobbies. It's not about getting back to 100% your old self right away. It's about enjoying life to the best of your ability one day and one activity at a time.

And finally, talk to your friends and the people who love you. This was certainly the darkest time in my life (so far) and I wouldn't be able to make it without them! Thank you everyone!

I really do mean it!

One year later update

Image sources: pixabay pixabay pixabay Silicon Valley pixabay pixabay The Boys

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

Nikolay Bogoychev — Sun, 07 Jul 2024 20:08:38 GMT

One of the main barriers to Large Language model deployment is the cost of inference. Lowering the computational footprint without hurting the quality of the model is an extremely hot topic in research due to the staggering costs that serving large language models entail.

Footage from OpenAI's headquarters when someone asks ChatGPT what's the time of the day...

The idea

As ~~Machine Translation~~ Large Language model researchers, we turned our attention to the obvious culprit: The output layer which represents the vocabulary of a Large Language model. It has dimensions HxV where H is the hidden layer dimension of the model and V is the vocabulary size. V is massive. For monolingual models such as LLaMa it's around 30,000 tokens but for multilingual models such as Bloom, it is more than 250,000! The output layer is the single largest matrix in the model, consumes a lot of memory, and its multiplication is quite costly.

In practice, however, we never make use of the full vocabulary at the same time. Most large language model queries and generations only contain a few dozens of tokens. If we could only somehow know in advance which tokens would be used during a generation, we could dynamically trim the vocabulary to a fraction of its full size.

When LLM performance gives you lemons...

The implementation

Dynamically reducing the size of the output layer is commonly used to speed up machine translation, as we can easily predict which words are going to be used in a translation, but the output of LLMs can be unbounded. How do we filter its vocabulary?

We can make the assumption that if a question is asked in English, the reply would also need to be in English. So we could reduce the vocabulary to the tokens only necessary for producing that language. We came up with not one but two separate ideas about how to achieve this!

Unicode based trimming

Use the alphabet. LLMs, especially multilingual ones contain vocabulary for multiple languages which are written in different scripts. We can remove all vocabulary items that are written in a different script from the one our target languages uses. We call this the unicode method, as we filter vocab based on unicode ranges.

Corpus based trimming

Use a small representative corpus to build a dictionary. For example build a dictionary from newspaper articles in the language you are interested. This method has the advantage of allowing words through that possibly could be spelled in a different script (IE named entities from foreign countries may be spelled in a foreign script). We call this the corpus method.

Those heuristics are quite rough and we were sure we could come up with something better to further narrow the vocabulary, but we also wanted to produce an upper bound of how much performance we could hope to gain. In order to do that we performed an oracle experiment, where we run a decoding pass over 50 examples and we take a note of the vocabulary they used, and then we limit the model to use only those vocabulary tokens. This results in vocabulary of only a few thousand tokens which would be difficult to achieve in practice.

The ups

We did get some ups!

We managed to reduce the decoding time by up to 20% in small models (25% if we consider the oracle experiment), but this is only when it comes to (comparatively) small 560M parameter models. Bigger models see only a modest 5-10% reduction in decoding time.
Smaller vocabulary means less memory! We reduce the memory usage of the output layer by a factor of 10 in most cases!

The downs

Speed increase is much lower in larger models, because vocabulary plays a lesser role in their computational footprint compared to small models (which are already quite fast on modern hardware).
Memory reductions are insignificant for Large models, as the vocabulary represents just a tiny fraction of the total number of parameters...
The quality of the generation drops. We expected the reduced vocabulary to produce identical generation to the full vocabulary but it turns out that things such as URLs and code samples mandate always Latin characters, but those were not available to our Chinese and Cyrillic models, resulting in more mismatches (labeled as misses on the table) and poorer generation quality. Chinese seems to suffer a lot more in this regard.
The methods perform inconsistently with different languages. Latin script languages have harder time getting their vocabulary reduced by unicode matching; Likewise, it might be difficult to find a representative corpus for a lower resource languages:

For cases where we didn't get an exact match to the genration using the full vocabulary, we computed BLEU and ChrF to estimate the quality.

Conclusion

We did achieve what we hoped for! We did improve generation speed.

However, in practice the methods we developed have limited practical use: It's very difficult to guarantee the quality with the reduced vocabulary which is a major show stopper. Furthermore, for large models the size of GPT-4, the computational cost in the output layer is tiny fraction compared to the cost of the attention.

Oh well, it's not all bad. Our methods could be useful for small models in memory constrained scenarios. We also saved other researchers time by publishing in the Workshop on insights from negative results in NLP!

Oh, and we got the best paper award!

Our small vanity corner.

If you are interested in the details, read the paper and cite us!

@inproceedings{bogoychev-etal-2024-ups,
    title = "The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics",
    author = "Bogoychev, Nikolay  and
      Chen, Pinzhen  and
      Haddow, Barry  and
      Birch, Alexandra",
    editor = "Tafreshi, Shabnam  and
      Akula, Arjun  and
      Sedoc, Jo{\~a}o  and
      Drozd, Aleksandr  and
      Rogers, Anna  and
      Rumshisky, Anna",
    booktitle = "Proceedings of the Fifth Workshop on Insights from Negative Results in NLP",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.insights-1.17",
    pages = "148--153",
}

Cya,

Nick, Patrick, Lexi, Barry

Image sources: pexels pexels pexels

Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

Nikolay Bogoychev — Thu, 30 Nov 2023 14:22:51 GMT

Language is a source of great misunderstandings. Translation, even more so. Machine translation... Well we all know how that goes:

It should be dried vegetables, but nameless translation service provider knows better...

Now, jokes aside, here is the problem. Words have different meanings in different domains and contexts. It's impossible for translators to know all possible domains and contexts, so they make use of terminology dictionaries.

The WMT 2023 terminology shared task challenges researchers to apply those terminology dictionaries to Machine translation, and we answered the call with several distinct systems:

Terminology-aware neural machine translation

The main idea is that we want to teach the neural network model to accept hints from the user about how to translate certain phrases. For example, if given the input:

Was ist Ihr Herkunftsland?

The model would produce:

What is your country of origin?

Which is correct, but we may want to influence the model to produce less formal translation:

Was ist Ihr Herkunftsland __target__ homeland __done__?

So that the translation changes to:

What is your homeland?

The neural network requires a large number of GERMAN_WORD __target__ ENGLISH_WORD __done__ examples from a terminology dictionary during training, so that it can learn this behaviour. Unfortunately, we often don't have access to a good terminology dictionary, so we build one from our training data!

Word Alignment based terminology dictionary

We use an IBM model to compute word alignments of our parallel training set, and then we take all the words with bijective mappings (that is to say each source word corresponds to exactly one target word) and use them as our pseudo terminology dictionary. Then, during training we randomly expose our model to 7% of those source-target terminology pairs using the ~~subliminal message~~ control sequences SRC __target__ TRG __done____target__ on the source side. We do this using our awesome OpusTrainer [blog] [paper].

A real time image of me trying to give subliminal messages to my neural network.

Now, this is all good, and it works quite well in practise: We get up 75% terminology success rate using this approach, but we are not guaranteed to follow the terminology constraints: The model is free to ignore the suggestion. This is why, we built two refinement approaches on top:

Negatively constrained translation

Since at inference time we have access to a terminology dictionary, we can figure out when a terminology constraint has not been followed, as it would not appear in the translation. We then use awesome-align to figure out which word was used instead of our desired terminology word. We then perform translation again, but this time we forbid that problematic word from being produced.

If we imagine our machine translation system's vocabulary as a dictionary, negatively constrained decoding amputates select words.

Ask-chatGPT-nicely

This approach was quite complicated and convoluted. Since we are already in the era of LLMs, we can just use the ask-chatGPT-nicely method to refine a certain translation with the desired terminology constraints. In fact, while we are at it, we decided to try and completely ditch the neural machine translation system and perform both translation and refinement using chatGPT.

State of NLP in 2023: All pray for solutions to our lord and saviour ChatGPT!

The process goes like this:

Produce translation (either through our terminology-aware system, with terminology constraints, or through asking chatGPT)
Ask chatGPT to refine the translation, incorporating terminology constraints.

Did it work?

Sort of. We (UEDIN) submitted 3 separate systems: A terminology-aware base by itself, one enhanced with ChatGPT, and another enhanced with negative constraints. Our terminology-aware system by itself, or using chatGPT refinement produced the best tradeoff between following terminology compared to competing systems, at least according to comet automatic evaluation:

	De->En	En->Cs	Zh->En
UEDIN_LLM	0.813	0.869	0.757
UEDIN_TERM	0.809	0.868	0.757
OPUS-CAT	0.790	0.869	0.521
AdaptTerm	0.801	0.841	0.688
UEDIN_CONSTRAINT	0.792	0.835	0.650
LinguaCustodia	0.735	0.834	0.609
VARCO-MTTSSNMT			0.755
BJTU-LB			0.751
VARCO-MTForceGen			0.715

COMET-DA22 scores for all systems participating in the shared task, illustrating tradeoff between terminology faithfulness and translation quality. Higher is better.

Our negatively constrained translation performed rather poorly: just because we prevent the model from making one mistake, this doesn't mean it wouldn't make another one. Using ChatGPT to translate and then to perform terminology refinement produced the best translation quality/terminology tradeoff, but this is not a surprise, since it's an unconstrained system. Our terminology-aware translation system did well, losing only to ChatGPT.

We have a lot more details in our paper, so please check it out! We worked really hard for it 🥹! You should also check out the shared task overview paper. Don't forget to cite us!

@inproceedings{bogoychev2023terminologyaware,
      title={Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting}, 
      author={Nikolay Bogoychev and Pinzhen Chen},
      booktitle = "Proceedings of the Eight Conference on Machine Translation (WMT)",
    month = dec,
    year = "2023",
    publisher = "Association for Computational Linguistics",
}

Thanks,
Nick and Pinzhen

Image sources: Google Deviant Art pixabay pexels

OpusCleaner and OpusTrainer: Machine translation training made easy

Nikolay Bogoychev — Tue, 28 Nov 2023 12:50:44 GMT

One of the big challenges that I have had to tackle as a senior (at least in theory) machine translation researcher is how to explain to novices in the field what a good machine translation makes.

It's quite terrifying to think that I am supposed to actually know how to do this.

It's quite difficult to answer this question because there's no one single correct answer. The process is very long with lots of if-then-else decisions that need to be taken which makes it unnecessarily confusing to newcomers. Let's illustrate the basic process:

Get parallel data
...
Train Neural Network
...
Profit

Parallel data

Unfortunately parallel data comes in many different shapes and forms. Every publicly available corpus has its own idiosyncracies and requires a personalised cleaning approach. To give some examples

A lot of UN corpora have a comma at the line ending, as opposed to a full stop.
Some corpora don't use French quotes (« ») when translating from English.
Some Chinese corpora come tokenized, some don't.
Some corpora have the direction inversed.
... Probably something else we have forgotten...

In order to do parallel data preprocessing right, we need to manually open and inspect each parallel corpus, see what is wrong with it, write a small script to fix it and move to the next one...

A data engineer furiously cleaning data.

Tedious is one word that comes to mind when describing the process, especially considering that there are always dozens of distinct data sources for each language pair. And nobody wants to do something tedious.

Training

Assuming we somehow survived the data cleaning process, we are now faced with the equally daunting task of training a neural network. Easy, right?

If it were only so simple...

Neural networks are notoriously fickle and break if you even as much as sneeze on them. Not fun. The problem is that all those data sources we talked about previously, no matter how hard we try to clean them, they will differ in quality. And it just so happens that neural networks like to see really simple data when they start training, and move onto more challenging data (or perhaps even noisy web data) later on in the training process. We may also want to:

Incorporate backtranslation pretraining, before we start training on the whole data.
Perform domain adaptation using in-domain corpora towards the end of the training.
Balance the mix of web and human curated data as the former exceeds the latter by a factor of 10 at least.

All of those issues require stopping-and-starting the training process and swapping around the training data sources; merging different data sources together, with different sample ratios (and if you get those wrong, then you have to re-sample, re-balance and redo everything). What's the word? Tedium.

Humans

Finally, we built our nice little machine translation system, and it is time it face the ultimate challenge: End users. You build your lovely machine translation system and you give it to your user and what do they do with it? THEY TRY TO TRANSLATE ALL CAPS TEXT. Why is this a problem? Well, we don't have that much ALL CAPS text in our training data. Our neural network wouldn't know how to process it so it will produce crap.

Another common usecase is translating text that contains untranslateable characters. Those can be emoji 😉, or wikipedia articles:

The Quran (/kʊrˈɑːn/, kuurr-AHN;[i] vocalized Arabic: اَلْقُرْآنُ‎, Quranic Arabic: ٱلۡقُرۡءَانُ‎ al-Qurʾān [alqurˈʔaːn],

The translation system's worst nightmare.

The reason why emoji or text in foreign script often breaks translation systems is that it has not been seen during training. Sentences with large amounts of foreign text are filtered away from the training data as noisy. So how are we going to reproduce them?

The best way to do this is to ensure our training data contains lots of examples of this sort, so that the neural network easily learns how to reproduce them. But how exactly do we do that? We can sprinkle emoji at random, but that's not really a good solution if the data is fixed and the same emoji always appear in the same sentences. The neural network will just learn to anticipate the sentences containing emoji and not really learn to properly copy them... Ideally we want every iteration of our data to have some sentences that include emoji, but different ones every time. If we use a static data source, we need to replicate it many times just so we can have our neural network see different sets of noise at each iteration... Tedious

The solution:

In order to solve all of those issues we created a set of open source tools OpusCleaner and OpusTrainer, which aim to streamline the process, remove the tedium and solve the aforementioned (and many many other) issues.

OpusCleaner

OpusCleaner is a streamlined data-processing and visualisation toolkit that performs all the data cleaning tasks through a visual interface, minimising the number of clicks the user has to perform. First, we provide a one-stop-one-click data downloader so we can easily fetch all training data for a given language pair.

Then, for each dataset, we can visualise a random sample from it and perform drag-and-drop chaining of various different filters that will fix any issues we notice. Such as wrong language used:

We can Apply fastText langID and voilà, suddenly a lot of lines from our sample are filtered out:

Presumably the ones that needed filtering, I hope. We can chain many different filters (on the right-hand side) and see how the sample changes after applying each of them:

Finally, you can assign a human-readable label to each dataset, and apply the filtering pipeline you have just defined to the full dataset:

Tadaaa! OpusTrainer is designed to save time and to turn data exploration and visualisation into an implicit step of the data cleaning process. This teachers new users important practical data skills and saves everyone tons of time!

OpusTrainer

OpusTrainer is a training set generator/augmenter. It is designed to provide a fast and easy way to define and produce different data mixes and augmentation using a yaml configuration file. We can define a rigorous training schedule, including pretraining on backtranslation, data mix ratios, and then have OpusTrainer generate that data and feed it to stdin of a neural network toolkit, or alternatively write it to a file:

datasets:
  bt: bt.gz # 12.4 GB
  cleanish: clean.gz # 2.4 GB
  medium: medium.gz # 1.8 GB
  dirty: dirty.gz # 33 GB

stages:
  - start
  - mid
  - end

start:
  - bt 0.9
  - clean 0.1
  - medium 0
  - dirty 0
  - until bt 1 # Until 1 epoch of bt

mid:
  - clean 0.45
  - medium 0.25
  - bt 0.1
  - dirty 0.2
  - until clean 1

end:
  - clean 0.25
  - medium 0.25
  - bt 0.1
  - dirty 0.4
  - until dirty inf

seed: 1111

More importantly, we can define data modifiers, that augment the training set on the fly with: UpperCase/TittleCase text; typos; Unicode noise (such as emoji/greek/chinese/any-random-script text), and more. And, the best part is that we only need to write a few lines of yaml to achieve this:

modifiers:
  - UpperCase: 0.05 # Apply uppercase randomly to 5% of sentences. See below
  - TitleCase: 0.05
  - Typos: 0.05
  - Tags: 0.005 # This introduces emoji/foreign text tokens
    augment: 1
  - Noise: 0.0005 # This introduces foreign text full sentences
    min_word_legnth: 2 # Minumum lenght of each fake word, defaults is 2 if left blank
    max_word_length: 5 # Maximum lenght of each fake word, defaults is 5 if left blank
    max_words: 4       # Maximu number of fake words, default is 4 if left blank

An example French-English sentence pair from our data augmenter looks like this:

On a connu 🙁 😬 😭 la suite ! ↔ We know 🙁 😬 😭 the rest!

It looks a bit silly, but it gives our neural network an important signal that when it sees something silly, it should just copy it to the output and not think about it too hard :D.

OpusTrainer augmentation allows us to get up to 92% accuracy on copying noisy foreign text in our translation systems, up from just 55% in the baseline, as described in my talk on Robust Machine Translation, at the 2023 MT Marathon.

A good example is attempting to translate the first sentence of the French Wikipedia article about the Quran. The sentence:

Le Coran (en arabe : القُرْآن, al-Qurʾān?, « la récitation ») est le texte sacré de 🕌 l'islam.

receives a somewhat lackluster translation due to the model's inability to cope with out-of-vocabulary characters.

The Qur'an (in Arabic: ااااااااااااااااااااااااااااااااااا, al-Qurاān?, "recitation") is the sacred text of u Islam.

But after applying OpusTrainer's UTF-8 noisy augmentation we get a significant improvement:

The Qur’an (Arabic: القُرْآن, al-Qurēn?, “recitation”) is the sacred text of 🕌 Islam.

OpusTrainer augmentation allows for producing high quality terminology aware systems, such as the one described in Bogoychev and Chen (2023). Terminology aware systems allow us to enforce certain words to be translated in a particular way, overriding what the system thought would be best. For example, translating this German sentence into English yields a reasonable translation:

Was ist Ihr Herkunftsland?
What is your country of origin?

However, using a terminology aware system, we can apply a terminology constraint and force the word homeland to appear in the translation.

Was ist Ihr Herkunftsland __target__ homeland __done__?
What is your homeland?

Training configuraiton example for terminology-aware system is available on GitHub.

This project has been the result of a large collaboration by the consortium of the HPLT project. Our goal is to make it easier for anyone to build high quality machine translation systems by creating robust and mature data cleaner and data scheduler. Come and try it out! For questions specific to either OpusCleaner or OpusTrainer, open a GitHub issue!

For more details, please refer to the paper. Also, don't forget to cite us:

@misc{bogoychev2023opuscleaner,
      title={OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models}, 
      author={Nikolay Bogoychev and Jelmer van der Linde and Graeme Nail and Barry Haddow and Jaume Zaragoza-Bernabeu and Gema Ramírez-Sánchez and Lukas Weymann and Tudor Nicolae Mateiu and Jindřich Helcl and Mikko Aulamo},
      year={2023},
      eprint={2311.14838},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Thanks,
Nick

Image sources: pixabay pexels flickr pix4free

Hanzi of the day! 血 and 皿

Nikolay Bogoychev — Tue, 22 Nov 2022 11:35:51 GMT

Thanks to me unexpectedly* contracting Covid on my trip abroad and having my holiday reduced to a confinement in a very expensive hotel on the equator, Hanzi of the day is back with a new edition! Today's topic is a bit biblical in nature as it includes chalices filled with blood.

* Not really unexpectedly, as my astronomically bad luck has recently raised a dispute among senior statistitians regarding the notion of Independence in probability theory.

“This cup is the new covenant in my blood, which is poured out for you” (Luke 22:20)"

Today's character blood 血 xuè is actually just a chalice 皿 mǐn with a single drop of animal blood in it. Originally the logogram 血 depicts a bronze container with animal blood used in ritual sacrifice. The two characters never lose their connection through the centuries and have evolved in parallel:

I wish I could say no animals were harmed in the making of these characters, but the leftmost script is oracle bone script, and no, it's not called like that because of oracles using their own bones.

The concept of animal sacrifice is pretty much universal across cultures. Evidence of it exist in:

Prehistoric Ancient Egypt.
It features prominantly in all of the Abrahamic religions.
It's common across the vast majority of pagan religions (For example the Celtic people).
And of course, it appears in Ancient China, where the value of each animal sacrifice was formalised in a strict hierachical structure. Thanks Confucious, really not sure what we would've done without it.

Now, the best part of it all is that nowadays we do have a mass produced 3D version of the 血 character, which conveniently serves as a mnemonic for Chinese learners across the globe:

That's what those are for, right?

Don't catch Covid on your holidays... Or at all.

Nick

Image sources: pexels pixabay wiktionary wiktionary pexels

Concussions suck: My story

Nikolay Bogoychev — Mon, 20 Jun 2022 13:56:16 GMT

15 weeks ago I went for a small ski holiday in my home country that replaced my boring workaholic life with a vivid fever dream. This is my concussion story.

The pledge

I went for a ski holiday in Bulgaria with a few buddies back in the beginning of March. It was a long weekend which meant that the ski slope was overly saturated with eager skiers and not particularly well maintained.

To avoid overcrowding, we took to the slopes early in the morning and had an hour or so of quite nice, relaxing skiing. Inevitably though, the clock hit 10:30 AM and after arriving to the lifts with my friend, we both groaned as we saw the massive queue. We were like 20 meters away from the queue and we were slowing down. I turned to my friend in exasperation, I got distracted and lost control of my ski: My left ski ran into a snow pile, while my right ski continued clear which resulted in me turning around 180 degrees.

Oh, no, no, no...

And then, disaster struck, and both of my ski dug into the ground, which meant that I came to an abrupt stop, but my body was carrying a forward momentum. That momentum had to go somewhere, which meant that my ski lifted up a bit from the ground and then got wedged quite firmly in the ground. At that point, the momentum of my body was firmly pointing downwards which meant that I was slammed, back first into the ground, whiplashing my head into the snow.

Moments before disaster struck.

Ouch. I looked at the sky briefly and I thanked my past self for having the foresight to wear a helmet. I immediately developed some slight headache and my mouth started salivating a bit. A micro concussion I thought, as I slowly stood up. My friend came to a stop and and asked me the typical concussion questions, such as, do I know who I am, where I am, etc. Since my memory was intact, and I felt clear headed, I continued skiing for the rest of the day, albeit taking it easy.

We finished the day at about 3 PM, I went back to the hotel, took a painkiller for my back (unrelated injury), and went about my day. Had dinner, then went to the karaoke bar. The headache disappeared and I was feeling quite good, even sung Let it Go! and Dancing Queen. I was doing great and I thought the accident would just be a bad memory in the morning.

The turn

Late at night, I went back to my room, pulled my laptop out to do some work, just to reinforce the stereotype that academics never go on holiday and... Oh my god, I just couldn't look at the screen. It's like looking at the computer screen gave me a strong buzzing in my head and I started going dizzy. I closed the laptop and went to bed, now slightly worried. Ok, I definitely have a concussion. Oh boy, little did I know...

I woke up next morning, and I took my phone to look at the time. The buzzing in my head returned as soon as I looked at the small screen. I guess I am not skiing today. A friend of mine drove me home. My condition deteriorated noticeably in the next two days, just as steadily as my terror grew.

24 hours after impact

Light started bothering me, like, a lot. I had to be in a dark room all the time.
Things were... Blurry. Fast moving objects, especially at night, looked a bit weird.

48 hours after impact

Reading became difficult and quite tiring.
The left part of my face went numb.
Taking walks made me dizzy.
Talking about anything complicated made me dizzy.
I needed a nap after literally any simple task, such as chopping vegetables for a meal.
I couldn't follow a conversation with more than one person.

72 hours after impact

I noticed that my world had started shaking. With every heart beat, my whole field of view would jump, just a little, but quite noticeable.
Closing my eyes didn't stop the flickering. It persisted until I fell asleep.

At this point I was seriously panicking. I couldn't do any work, hell, I couldn't do almost anything. Essentially anything that moved made me dizzy. I couldn't look at people's faces when they spoke to me, as it made me dizzy. At this point, I went to see the doctor, I told her about my symptoms and she did the finger test. I could not follow the fuckin finger. I asked my doctor Aren't you moving the finger a bit too fast? No, she said, with a slight tremble in her voice. Go and see a neurologist, now!

I never knew how hard it is to pass this examination.

My father helped me book a neurologist appointment for the next morning and I went to sleep at about 9 PM.

I turn and toss in my dreams, having violent nightmares. I wake up sweating and think that maybe the whole thing was just a dream. I look at my phone to check the time and the buzzing in my head returns. No, this fever dream of mine is real. I fall asleep again...

The prestige

The next morning, I headed for my first ever CT scan. Admittedly, all the other people there looked in a much worse shape than me and surely my injury wasn't that bad. The doctor says I'm lucky and that there's no bleeding, or brain swelling, it's just a "minor concussion". Only later did I find out that they refer to any injury in which your brain is not sticking out of your skull as a "minor concussion".

Afterwards, I went to see a neurologist, who somewhat dimissively said it's just a concussion and prescribed me generic over-the-counter brain supplements and a week of intravenous Mannitol.

Then I went to see an ophthalmologist for my visual symptoms. She formally diagnosed me with Horizontal Nystagmus, which is the inability of the eyes to focus on objects. This is essentially what prevents me from reading, or doing pretty much anything. She prescribed me a month worth of travel sickness pills and told me to rest well and stay in a dark room until I recover. I was banned from any form of exercise that moves my head or raises my blood pressure, which basically left the stationary bike as the only choice.

Well done me. Took a 3 day holiday once a year and gave myself a concussion.

Aftermath

At that point, I was somewhat relieved. After all, no brain swelling, no brain bleeding. I knew that most concussions resolve by themselves within two weeks, and after all at that point, I had no reason to think it would go any longer than that. Ah, such good are the times filled with hope.

I discovered several coping mechanisms that made my daily life easier.

Take off my glasses. This myopia-enabled lifehack made sure that I walked around in a sea of blurriness and gave my brain a rest from any overly acute vision. This helped out a ton from preventing dizziness.
Eliminate blue light from all my devices. Less pretty, but much easier on my vision.
If going for a walk with someone, close my eyes and ask them to lead me like a blind person. Much easier for me to go out that way rather than on my own.
Do not use any lights after dark, and stay in a dark room.

The dullness

Oh my god, I never realised how much boredom a person can feel. Things that I could do during the day were limited to:

A short walk, accompanied by someone.
A phone call or two with a friend, but not too long.
A conversation with one person, but mostly in a dark room.
Cooking, albeit with naps in between steps.
An audiobook, albeit not more than 20 minutes or so.

Everything left me utterly exhausted. Small tasks such as translating my 4 line sick note into English and sending it to HR required an hour of nap afterwards to recover.

The friends and family

I was grateful to be surrounded by so much love from everywhere. As I was unable to read my messages, everyone promptly converted to sending voice messages to me. People made sure to check on me regularly and to try to keep me entertained with stories from their lives. Some did medical research for me, others volunteered their own experience with concussions. Some visited me at my home and sat with me in darkness, while others accompanied me as a chaperon to important events that I could not put off or doctor's visits.

My supervisor went to HR and did all the annoying bureaucracy navigation so that I was promptly put on a sick leave and not bothered by anything. My PhD advisor came to see me. My coworkers and collaborators promptly took away all my duties so that I could focus on my recovery.

My family was available at all times to chat to me, or to drive me to my treatment, as necessary.

And finally, my online gaming buddies noticed my abrupt departure.

This scene does to this day give me the chills and brings tears into my eyes.

A few of them inquired about me and managed to obtain my phone number, so they could ask about my recovery. Whoever says that gamers are shallow and do not care about anything is so utterly wrong.

The long, twisted and uncertain road to recovery

In the first week or so, things got a bit better, most notably I was able to listen to audiobooks for hours at a time. I was convinced that I would be able to return to work after the two week period, but that turned out to not be the case.

Error and error

In the second week my recovery slowed down. I would wake up and attempt to do something and inevitably suffer for it.

The left side of my face was numb all the time, and it got worse if I strained myself, almost like a litmus test for what I was supposed to do and not do. Gradually I mapped for myself lists of activities that made things worse.

Rustling leaves, water reflections, busy streets or the sight of boiling water all made me incredibly dizzy. My eyes could not cope with multiple moving objects at once. I walked almost everywhere without my glasses.

Sudden head movements, such as tossing in your sleep, or skipping a step at the stairs, or taking a bus that takes a sharp turn resulted in days of headache, and noticeable worsening of my nystagmus.

Any amount of screen time felt bad. The small phone screen felt a lot more comfortable, but reading text with lines wider than 3 cm was impossible.

Pretty much the only thing that felt good was lie in bed, in utter darkness and silence. Solitary confinement on the other hand is not exactly good for anyone's mental health.

This lead me to scour the Internet for answers and tips, hopefully to speed up recovery.

Surely I could do something

I mean, modern medicine is great, they can cure almost anything. Well, it's true, but concussions are very tricky for several reasons:

It's difficult to study them. One could easily find unvaccinated volunteers to infect with covid in the name of science, but I can hardly imagine a bunch of people volunteering to be given concussions in a predictable and controlled environment.
Every person has slightly different brain chemistry, susceptibility, and reaction to head injury. Recovery timelines, treatments vary a lot from person to person and seem to be only loosely correlated with the severity of the blow.
80% of concussions cases in young, healthy individual take under 2 weeks to recover, making the other 20% not only unlucky, but looking for help from doctors who aren't sure what to make of their symptoms. It should have gone away by now, do you just want to try some more paracetamol?
Most concussion research, specialist and treatment centres are localised in the USA. No idea why.

Not because of American Football. Jared Wickerham/Getty Images

Surprisingly, the knowledge that most people recover a lot faster than me didn't make me feel better. Go figure.

Next, I decided to go on reddit and read about people's experience with concussion recovery, and that was, what I would call an exercise in depression. The only people that would go on reddit to discuss their concussion are the ones who do not recover. Essentially I ended up obsessively reading about other's suffering and hopelessness, while worsening my symptoms. Not fun.

My doctor kept saying, to just wait and wait until I feel better. No timeline on when that "better" is going to happen.

Things that help

Just as I stumbled on things that don't help, I also stumbled upon things that do help... Me. They might not help other people as concussions are very individual, but this is what helps me.

Eliminate blue light from all your screens. It makes every single light emitting devices infinitely easier to do use.
High refresh rate screens. I discovered this randomly by visiting a gaming buddy and noting how his screens didn't bother me nearly as much as mine did and turned out they are high refresh rate. I replaced my phone, laptop and desktop monitors and this allowed for significantly more screen time, although it still bothered me.
Eliminate screens altogether, as difficult as that might be. I seldom managed.
Eliminate stress. Easier said than done, but the first time when my face numbness abated was when I booked a sensory deprivation chamber and managed to let go of my thoughts.

I stay up at night, wondering if I would ever be whole again. I fall asleep in the small hours of the day and wake up in panic. The left side of my face is ever so numb, something always feels wrong.

The following months

The worst thing about concussion recovery is how non-linear it is. Some days are fine, some days are not. Some days, I can work for a few hours with constant, although not-getting-worse headache. Other days, I can not even look at a screen for 5 minutes.

I started getting gently into exercise and that was fine until it suddenly wasn't. It's very hard to know when you overdid something, as you wouldn't feel the effects immediately, but in the next few hours, and end up spending a week in bed. You panic and you look for treatments online, but they all seem to be US based, have 3-4 month waiting time and come with a hefty pricetag. You wonder if it is worth booking one now, or you would get well on your own.

Your life essentially goes on a pause. Professional, personal, everything. It just stops. Your work output is close to zero and your social interactions are limited as you need to avoid crowds and noisy places.

Except you don't know if there is a resume button or you have just broken everything.

Drugs

As a final bit, I will provide a list of supplements that I have been taking, hoping some of them are helping to improve my concussion. This is all research done by me, and not recommended by a doctor. I DO NOT RECOMMEND that you take any of those if you have a concussion, do so at your own risk. Consult a doctor, preferably a concussion specialist.

Turmeric & Boswellia Serrata -> Anti inflammatory, no brainer.
Creatine and Taurine -> supposedly, they are necessary for neuron building
Omega 3, alpha-lipoic-acid -> Generic brain supplements.
NAC (N-Acetyl Cysteine) -> Supposedly has neuro protective effect, especially if taken right after concussion. Extensive research is done, with some human trials, but still very early stage.
Magnesium L-Threonate -> An alternative form of magnesium, that passes the blood brain barrier. Helps some people with cluster headaches and is allegedly good for your brain.

The current state

As of today, I'm entering week 15 of my concussion. I can mostly do non demanding computer work (hence this blog post), but coding is hit-or-miss. I struggle a lot with reading A4 length text or side-to-side text on wide screens, but it is getting better. My world flickers a bit every once in a while, especially if I overexert myself by reading, but it is also getting better.

I can not watch any videos or TV series that have abrupt changes in the colour scheme (eg, trailers, jumping from one scene to another). Some sports such as snooker, football are OK, as they maintain mostly static colour scheme.

Airports, noisy restaurants make me dizzy. I am the weird guy that goes to play pool in the bar with noise canceling headphones. Going on walks seems to be mostly fine, but on some days it's not.

I never realised before how much I love what I do. I am full of ideas, plans and projects that I will work on when I get better. I keep a list of cool coding projects to do, movies to watch, books to read and sports to play when I get better... And this is where the fear kicks in.

I am afraid to do almost anything. If I decide to jump, would I spend the next week in bed? Can I drive? Can I go swimming? Can I travel? Can I see a movie? I keep falling behind on work and missing opportunities. Will I ever go back to work? Will I have to worry about financial security? Will I ever play a video game again?

I hate living in fear all the time... But I am alive. Treatments for people that do not get better on their own exist and I can afford them should I need them. Concussions are a horrible thing to happen to anyone, but I mine is not a severe case and I am getting better, albeit slowly. I have friends and family that support me and I am surrounded by care and affection.

Two years later update Three years later update

Image sources: imageStock e4s tangotribe imageStock reddit pixabay npr flickr

Efficient machine translation

Nikolay Bogoychev — Sun, 21 Nov 2021 22:39:43 GMT

The tutorial run on 22.11.2021. A recording of the live session is available on Youtube. Also if you have any questions, don't hesitate to email me.

Prerequisites

Download and install sacrebleu:

pip3 install sacrebleu

Getting marian

Download and install Marian:

git clone https://github.com/marian-nmt/marian-dev.git
cd marian-dev
mkdir build
cd build
cmake .. -DUSE_FBGEMM=ON -DCOMPILE_CUDA=OFF -DCOMPILE_CPU=ON
make -j4

Note that marian requires intel-mkl for CPU decoding and CUDA for GPU decoding. Please make sure that you have MKL installed on your local machine before compiling. The package name could differ between distros. For Ubuntu 20.04 please install this. Alternatively, you can use do this to install it directly on ubuntu 16.04 or newer:

wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB' | sudo apt-key add -
sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update
sudo apt-get install intel-mkl-64bit-2020.0-088

More details on the whole Marian and MKL issue are listed on Marian's website.
If you have nvidia GPU and CUDA installed on your local system, you can switch the COMPILE_CUDA flag to ON.

Marian should compile cleanly on Linux, WSL and Mac. For Mac users, you want to set up a dev environment and use the built in Apple accelerate framework by providing the cmake flag -DUSE_APPLE_ACCELERATE=ON.

Getting the data

Download and extract the test models tarball.

wget http://data.statmt.org/nbogoych/mt_marathon.tar.gz
tar -xvf mt_marathon.tar.gz
cd mt_marathon

Note that data.statmt.org might be slow to respond due to the number of concurrent users, so you could try instead this mirror link:

wget https://nbogoychev.com/files/mt_marathon.tar.gz

Now your directory structure should look like this:

$ tree
.
├── enes.student.tiny11
│   ├── basic.translation.sh
│   ├── batched.shortlisted.8bit.translation.sh
│   ├── batched.shortlisted.translation.sh
│   ├── batched.translation.sh
│   ├── config.batched.shortlisted.8bit.yml
│   ├── config.batched.shortlisted.yml
│   ├── config.batched.yml
│   ├── config.yml
│   ├── lex.s2t.bin
│   ├── lex.s2t.gz
│   ├── model.intgemm8.bin
│   ├── model.npz
│   └── vocab.esen.spm
└── enes.teacher.bigx2
    ├── basic.translation.sh
    ├── batched.translation.sh
    ├── config.batched.yml
    ├── config.yml
    ├── model1.npz
    ├── model2.npz
    └── vocab.esen.spm

MT decoding

In this section we will gradually try different marian settings and models, starting from the slowest and progressing to the fastest.

The teacher model

The teacher model refers to the highest quality system that you train for any translation task. This is the system trained with "all bells and whistles", but unfortunately it is also quite slow. We will also talk about training it later on.

In our system, the teacher model is an ensemble of 2x transformer-big for English-Spanish.

This is a fairly big model and I do not recommend that you try to run it now during the tutorial. You should definitely run it later on, on a cluster just so that you see how much time it takes to translate. I will publish the results here for test runs on my machine. The basic configuration of the teacher model is found by looking through config.yml file and the basic.translation.sh script. I will summarize and explain it here.

$cat config.yml 
relative-paths: true
models: # Selects models for ensembling
  - model1.npz
  - model2.npz
vocabs: # Selects vocabulary for each model
  - vocab.esen.spm
  - vocab.esen.spm
beam-size: 4 # beam search size
normalize: 1 # length normalisation
word-penalty: 0 # length penalty

And the script:

cat basic.translation.sh 
#!/usr/bin/env bash

MARIAN=../../marian-dev/build # Path to your marian installation

SRC=en
TRG=es

mkdir -p basic

sacrebleu -t wmt13 -l $SRC-$TRG --echo src > basic/newstest2013.$SRC # get the test set using sacrebleu


# Call marian
echo "### Translating wmt13 $SRC-$TRG on CPU. Extra flags $@"
$MARIAN/marian-decoder -c config.yml $@ \
    --quiet --quiet-translation --log basic/gpu.newstest2013.log \
    -i basic/newstest2013.$SRC -o basic/basic.newstest2013.$TRG

# Print the time it took for translation, and the BLEU scores
tail -n1 basic/gpu.newstest2013.log
sacrebleu -t wmt13 -l $SRC-$TRG < basic/basic.newstest2013.$TRG | tee basic/basic.newstest2013.$TRG.bleu

Baseline system

Run the baseline system, replacing N with the number of real cores your CPU has.

$ cd enes.teacher.bigx2
$ ./basic.translation.sh --cpu-threads N
### Translating wmt13 en-es on CPU. Extra flags --cpu-threads 16
[2021-11-19 10:45:40] Total time: 4597.81777s wall
{
 "name": "BLEU",
 "score": 36.5,
 ...
}

This script and all following scripts will output translation time and the BLEU score of the test set. I have trunkated the output in the interest of readability

Mini batch size

The problem with the baseline system is that we are only ever translating one sentence at a time, which makes our matrices tall and skinny, which is not something that modern hardware likes. Instead, we're now going to group sentences to be translated together. We do that by augmenting the configuration with the following options:

mini-batch: 16 # Sentences to be translated together
maxi-batch: 100 # Look at the next 100 sentences to sort sentences with similar length
maxi-batch-sort: src # Sort by source sentence length
workspace: 4000 # Memory budget per worker.

You could also add those options directly to the Marian command in the script as I have done. Run it with:

$ ./batched.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 11:24:43] Total time: 652.14863s wall
{
 "name": "BLEU",
 "score": 36.5,
 ...
}

As you can see, specifying batching dramatically increases the translation speed.

The student model

The student model is what typically machine translation providers will run on their production services. Those models are highly optimised for speed, with architectures, that typically offload the heavy weight attention in the decoder with something faster, such as the Simplified Simple Recurrent Unit (SSRU).

The student model is trained on the output of the teacher model(s in the case of ensembling). More details about the training and the exact specifics of the architecture will be given later. There are couple of crucial things we need to know about decoding with student models:

Beam search is unnecessary! Quality is the same regardless of the beam size.
No need for ensembling either.
The student model is tiny compared to the teacher: Less than 10 times the number of parameters!
For more information, check this paper.

All of that allows the student to produce translations much faster, at a small cost of BLEU.

The config of a basic student model is similar to the that of the teacher, except, that the beam size is set to one. Furthermore, since the beam size is one, we don't actually need to compute the expensive softmax, but we can instead just take an argmax:

$ cd enes.student.tiny11/
$ cat config.yml 
relative-paths: true
models:
  - model.npz
vocabs:
  - vocab.esen.spm
  - vocab.esen.spm
beam-size: 1 # Beam size to one
normalize: 1.0
word-penalty: 0
max-length-factor: 2.0 # The target sentence shouldn't be longer than 2x the source sentence
skip-cost: true # Do not compute softmax but instead take argmax

Baseline

To run the baseline system, just do:

$ cd ../enes.student.tiny11
$ ./basic.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:03:11] Total time: 84.66556s wall
{
 "name": "BLEU",
 "score": 35.2,
 ...
}

Mini batch size

Just like in the teacher case, we can significantly improve translation time by specifying larger mini-batch size

$ ./batched.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:03:11] Total time: 11.66556s wall
{
 "name": "BLEU",
 "score": 35.2,
 ...
}

Shortlisting

Further improvements in the translation speed may be achieved by avoiding the largest multiplication at the output layer, by supplying the model with a lexical shortlist. The lexical shortlist filters the output layer to only contain words that are deemed to be likely translations of the input sentence, potentially reducing its size from 30k to about 1k. We will walk through the construction of a lexical shortlist, but you can see the speed improvement from the example script:

$ ./batched.shortlisted.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:05:03] Total time: 8.41856s wall
{
 "name": "BLEU",
 "score": 35.2,
 ...
}

The only change we made is the inclusion of the shortlist parameter inside config.batched.shortlisted.yml:

shortlist:
    - lex.s2t.bin
    - false

The lexical shortlist is similar to a dictionary containing frequently associated source and target words. We will look at how it's trained later in this tutorial.

Quantisation

To further improve performance we can perform the neural network inference in lower precision numerical format, such as 8 bit integers, which runs much faster on CPUs compared to plain old floats. To do so, we must first covert our model to 8-bit integer format:

$MARIAN/marian-conv -f model.npz -t model.intgemm8.bin -g intgemm8

and then decode with it:

$ ./batched.shortlisted.8bit.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:05:58] Total time: 7.12242s wall
{
 "name": "BLEU",
 "score": 35.0,
 ...
}

That's all exciting right? Now, let's dive deep down, and see how we can get to those student models

Results summary

Here is the summary of the results that I run:

System, 16 threads, 3000 sentences	time	BLEU
Teacher basic	4597s	36.5
Teacher batched	652s	36.5
Student basic	84s	35.2
Student batched	11s	35.2
Student batched shortlisted	8s	35.2
Student batched quantised shortlisted	7.1s	35.0

Unfortunately, as the systems get faster, the runtime between different systems gets muddled due to initialisation time overhead. In order to show better the effect of all our options on the runtime, I will repeat all student experiments using 1 thread:

System, 1 thread, 3000 sentences	time	BLEU
Student basic	189s	35.2
Student batched	38s	35.2
Student batched shortlisted	27s	35.2
Student batched quantised shortlisted	21s	35.0

We get a huge gain when going from unbatched to batched translation, regardless of the model size. We gain about 30% efficiency from shortlisting and then another 23% from quantising to 8 bit integers. We do sustain some drop in BLEU in the meantime though.

How to train your own efficient model

In this part of the tutorial we will go through the steps necessary for preparing efficient machine translation system. Due to time and resource constraints (AKA training taking for-fuckin-ever).

Training a good teacher

A good student can not possibly hope to learn without the help from a marvelous teacher.

Clean your data!

Before starting training do your customary data cleaning. Most people use custom scripts for this which are tailored towards the specific language pair, but the bare minimum can be achieved using good old moses clean-corpus-n.perl:

$ mosesdecoder/scripts/training/clean-corpus-n.perl data/corpus.uncleaned $SRC $TRG data/corpus.tok 1 100

In this particular example, sentences with length of less than 1 token or above 100 tokens are excluded. Modify this accordingly.

Train a reverse model for back-translation

You can skip this step if you do not have any monolingual data

In order to train a high quality system, we usually take advantage of the available monolingual resources in the target language. This is done by training a translation model in the reverse direction and then translating the monolingual corpora with it. For more details, please check this paper. A typical marian configuration for the purpose would look like this:

devices: 0 1 2 3
workspace: 12000
log: model/train.log
model: model/model.npz
train-sets:
  - train.clean.de
  - train.clean.en
seed: 1111
vocabs:
  - model/vocab.deen.spm
  - model/vocab.deen.spm
task: transformer-base
dim-vocabs:
  - 32000
  - 32000
shuffle-in-ram: true
# Validation set options
valid-sets:
  - dev.de
  - dev.en
valid-freq: 5000
valid-metrics:
  - ce-mean-words
  - perplexity
  - bleu-detok
disp-freq: 1000
early-stopping: 10
beam-size: 6
normalize: 0.6
max-length-factor: 3
maxi-batch-sort: src
mini-batch-fit: true
valid-mini-batch: 8
valid-max-length: 100
valid-translation-output: model/valid.bpe.en.output
keep-best: true
valid-log: model/valid.log

This configuration assumes a 16 GB GPU. If you have a smaller GPU, please adjust down the workspace accordingly.

After the model is trained, translate your monolingual data using the output-sampling option which has shown to produce better results. Furthermore, monolingual data might be dirty, so make sure you set max-length and max-length-crop crop options. You should append those to your configuration file used for translation:

output-sampling: true
max-length: 100
max-length-crop: true

Some works suggest that synthetic data should be tagged when appending it to the true data. How much backtranslated data to use is an open question. Generally the more the better, although you may want to balance/upweight/upsample true data if it is too little compared to the backtranslate data. Refer to Facebook's wisdom.

Train the teacher

Once you have your synthetic backtranslated data (optional) and your parallel corpora, you can proceed to training your teacher model.

Configuration choice

You could train your teacher with two separate configuration prefixes: Either task: transformer-base or task: transformer-big. As a rule of thumb, if you have a high resource language >5M sentence pairs, you will likely see gains from using transformer-big.

For transformer-base you can reuse the configuration setting posted earlier.
For transformer-big, adjust down the workspace to 10000 and add the option optimizer-delay: 2 to the configuration.

No need to adjust any other configuration settings, as the task alias takes care of assigning the rest of the model hyperparameters.

Ensembling

One very easy way to improve translation quality of the teacher is to produce an ensemble of systems that produce translation together. This is done by training identical systems, initialising them with different random seed. The more systems, the better, although returns are diminishing.

For example, if we want to have an ensemble of two systems, we need to separate configuration files for training, where the seed parameter is different. Configuration one would have seed: 1111, whereas configuration two would have seed: 2222. At decoding time, don't forget to load all produced models as shown earlier in the tutorial.

Training the student

The student model is trained to approximate the teacher distribution. In this manner we can achieve translation quality approximating that of a teacher model, at a just a fraction of the computational cost. For more information, check this paper. Up to date guide with code and scripts for training student models can be found on github.

Producing the distilled training data

The training data for the student is produced by translating the complete training set with your teacher ensemble. This is a cumbersome task, because the teacher model is heavyweight. I recommend that you use the same settings as the ones discussed earlier in this tutorial.

It is possible to quantise the teacher model(s) before translating, but depending on the language pair and configuration, this might lead to a substantial drop in the quality of the translated data. More details later.

If you are decoding on a fairly recent NVIDIA GPU, feel free to add the fp16: true to the decoder configuration, in order to use 16 bit float decoding.

More details about these will be given later in the tutorial.

Producing word alignments

Before we can proceed to training the student model we need to produce IBM model alignments using fast-align. The alignments are necessary so that our models have guided alignment, baked in, as well as the lexical shortlist.

The script that takes care of everything is located on github. You need to edit generate-alignment-and-shortlist.sh in order to put the locations to the corpora and the SPM trained vocabulary.

Training the student

Now that the teacher model(s) have translated the full training set, we can use that as the input to the student model. The student model is trained on the original source text, and the synthetic, translated target text. In our experiments so far, the student models trained used this configuration, dubed tiny:

$ cat config.yml
dec-cell: ssru
dec-cell-base-depth: 2
dec-cell-high-depth: 1
dec-depth: 2
dim-emb: 256
enc-cell: gru
enc-cell-depth: 1
enc-depth: 6
enc-type: bidirectional
tied-embeddings-all: true
transformer-decoder-autoreg: rnn
transformer-dim-ffn: 1536
transformer-ffn-activation: relu
transformer-ffn-depth: 2
transformer-guided-alignment-layer: last
transformer-heads: 8
transformer-no-projection: false
transformer-postprocess: dan
transformer-postprocess-emb: d
transformer-preprocess: ""
transformer-tied-layers:
  []
transformer-train-position-embeddings: false
type: transformer

and the following training script:

#!/bin/bash -v

# Set GPUs.
GPUS="0 1 2 3"
MARIAN=../../marian-dev/build

SRC=en
TRG=es

# Add symbolic links to the training files.
test -e corpus.$SRC.gz || exit 1    # e.g. ../../data/train.en.gz
test -e corpus.$TRG.gz || exit 1    # e.g. ../../data/train.es.translated.gz
test -e corpus.aln.gz  || exit 1    # e.g. ../../alignment/corpus.aln.gz
test -e lex.s2t.gz     || exit 1    # e.g. ../../alignment/lex.s2t.pruned.gz
test -e vocab.spm      || exit 1    # e.g. ../../data/vocab.spm

# Validation set with original source and target sentences (not distilled).
test -e devset.$SRC || exit 1
test -e devset.$TRG || exit 1

mkdir -p tmp

$MARIAN/marian \
    --model model.npz -c student.tiny11.yml \
    --train-sets corpus.{$SRC,$TRG}.gz -T ./tmp --shuffle-in-ram \
    --guided-alignment corpus.aln.gz \
    --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 \
    --max-length 200 \
    --exponential-smoothing \
    --mini-batch-fit -w 9000 --mini-batch 1000 --maxi-batch 1000 --devices $GPUS --sync-sgd --optimizer-delay 2 \
    --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 \
    --cost-type ce-mean-words \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 \
    --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --disp-first 10 \
    --valid-metrics bleu-detok ce-mean-words \
    --valid-sets devset.{$SRC,$TRG} --valid-translation-output devset.out --quiet-translation \
    --valid-mini-batch 64 --beam-size 1 --normalize 1 \
    --early-stopping 20 \
    --overwrite --keep-best \
    --log train.log --valid-log valid.log

The script takes care of checking for all the necessary files and will fail if they are missing.

There are other possible student configuration. We have a base configuration prefix on github, which is slower than the tiny shown above but delivers better performance. Experiment with different configurations until you find something acceptable

Note that the student model will take forever to train. You want to really overfit to the outputs of the teacher, so going for many epochs over the data is necessary. You could see the BLEU score improvement stalling for many consecutive validation steps before improving again.

Quantisation fine-tuning

Finally we will talk about quantisation fine-tuning. When we quantise the model to a lower precision, we damage it. The model might not do well with that damage right out of the box so instead we are going to do fine tune it by training very briefly with a GEMM that mimics the damage from quantisation. To do this add the following to the configuration file:

quantize-bits: 8

Finetuning is really fast. The model's quality is going to start going down after a few thousand mini-batches. Make sure you have frequent validations so that you don't miss the sweet spot! (valid-freq: 200 would be good).

Results

A significant amount of compute time is required to train an efficient student model, so we can't do that for the duration of this tutorial. However, we can show you what you can achieve in practice! Take a look at our blazing fast, privacy focused, cross-platform translation app translateLocally, which is only made possible when utilising all of the techniques outlined above:

Caveats

Optimising for speed doesn't come without caveats. Translation quality drops to a certain extent. The drop is not uniform across models, so test before you deploy!

Quantisation affects different models differently. As a rule of thumb the smaller the student is the more it loses from quantisation, but very large teachers models have at times been shown to work quite poorly with quantisation. Always test before you deploy!
Lexical shortlisting is known to cause quality issues when used with very small mini-batch size. Proceed with caution when translating single sentences and using a lexical shortlist. This can somewhat be ameliorated by letting the shortlist be more conservative and thus increasing the number of vocabulary items it lets through during construction: $MARIAN/marian-conv --shortlist lex.s2t.gz 100 100 0 --vocabs vocab.esen.spm vocab.esen.spm -d lex.s2t.bin. 100 100 means take top 100 words from the vocabulary, and top 100 translations of each word according to the shortlist. Increasing this further will slow down translation, but improve quality.
Humans do notice the difference between teacher and student model. Also, METEOR scores which are shown to have better correlation with human judgements also favour teacher models. There is no free performance gain.

Advanced topics

In this subsection we will talk about advanced topics, that you may be interested if you are in the business of providing commercial machine translation systems or want to do a PhD in the subfield of Machine translation.

Hyperparameter tuning

Carefully tune your hyperparameters when decoding! Different combinations of models and hardware behave differently. More specifically, mini-batch: 16 is not a fast and hard hyperparameter. Depending on the CPU you have, the amount of cache it has, the amount of system memory (RAM) you have, there might be other optimal settings. Experiment with mini-batch, maxi-batch and workspace parameter until you arrive at an optimal solution for your specific configuration.

Efficient GPU decoding

Efficient GPU decoding differs from efficient CPU decoding in several key aspects:

GPUs need a lot larger mini-batch size to get their full potential. While for CPU decoding performance stops scaling around mini-batch of 16-24, on GPU decoding in practically scales until you run out of memory. In order to optimise for speed on the GPU, you need to push the workspace to the limits of the device memory, as well as the mini-batch size. For 24 GB GPUs like the 3090, mini-batch: 768 and workspace: 18000 are a good place to start your binary search.
Shortlists don't improve translation speed on the GPU. Which is great. Just ignore the shortlist
We have experimented with 8bit integer decoding on the GPU, but we failed to get any performance gains compared to just using float16 decoding. In order to use this mode to translate, just set fp16: true in the decoder configuration. You should get about 20% speed improvement compared to fp32 decoding, and the ability to use much larger mini-batch size.

GPUs are in general really fast, even when compared to decoding on multiple CPUs. Running against the batched student from the previous section:

$ ./batched.translation.sh --cpu-threads 12
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 12
[2021-11-22 17:15:25] Total time: 14.46366s wall
{
 "name": "BLEU",
 "score": 35.2,
 ..

Running on 1 GPU (the -d 0 flag species we should run on GPU 0):

$ ./batched.translation.sh -d 0
### Translating wmt13 en-es on the CPU. Extra flags -d 0
[2021-11-22 17:17:12] Total time: 3.61907s wall
{
 "name": "BLEU",
 "score": 35.2,
 ...

Providing better hyperparameters for the GPU:

$ ./batched.translation.sh -d 0 --workspace 18000 --mini-batch 768 --fp16 --maxi-batch 3000
### Translating wmt13 en-es on the CPU. Extra flags -d 0 --workspace 18000 --mini-batch 768 --fp16 --maxi-batch 3000
[2021-11-22 17:17:51] Total time: 1.44297s wall
{
 "name": "BLEU",
 "score": 35.3,

A fully optimised GPU is more than 10X faster on this very small example. If we increase the size of the dataset, the GPU will easily be 100X+ faster. In our case the GPU is A100 and the CPU and Ryzen 2 EPYC processor.

Advanced quantisation options

Marian supports two types of integer backends. fbgemm and intgemm, which deliver different performance depending on the model type and architecture.

Intgemm is hardware agnostic has dedicated codepath for both very old devices (SSSE3) and very new devices (AVX512VNNI). Fbgemm on the other hand only supports AVX2 and AVX512. Intgemm's format is hardware agnostic, whereas in order to use Fbgemm, one needs to know in advance what hardware it is going to run on. marian-conv --help will give you more details.

Finally, both intgemm and fbgemm have int16 format, which is not as fast the the int8 ones, but potentially could work better in some cases where 8bit quantisation damages translation quality too much.

Using speed oriented fork of Marian

This tutorial describes what can be achieved with marian-dev master alone. There exists however an experimental version of marian focused on speed as part of the bergamot project. How to use it, together with tutorial for creating models can be found on github. If you are interested in running a GPU fork with experimenta nvidia patches, you can also find it on github.

Research directions

In this section we will briefly go over current research directions for efficient MT. This is all bleeding edge stuff that I have seen in papers, but not in practice.

Deep Encoder/Shallow decoder

The most computationally heavy part of machine translation inference is the decoder, because this is where the autoregressive part of the computation happens, whereas the computation in the encoder happens only once. Based on that it has been suggested that one can change the standard 6-6 architecture to a 12-1 without a loss of accuracy, but significantly increasing translation speed. You should experiment to discover better student and teacher architectures!

Wider models, not deeper

Once you get into the domain of very high resource language pairs (50M+ sentences), increasing the number of parameters of your neural network architecture once again becomes relevant. Experience has shown that increasing the width of the model (meaning the dimension of the embeddings/rnn/hidden layer/attention) is more stable compared to increasing the depth of the model. Very deep neural networks sometimes are unable to train at all, but very wide neural networks don't seem to suffer from the same shortcomings. If you have the data, and the compute, you should go wide, not deep!

KNN based shortlisting

As we have shown, lexical shortlisting provides a noticeable gain in inference speed, but it may lead to quality issues. IBM models are bad at capturing excessive subword segmentation or idiom expressions. As a resulting lexical shortlists produced with IBM models favour more literal translations and struggle with cases where there is a big subword atomisation. In order to alleviate this issue, the community is exploring KNN based shortlisting (refer to this and this).

Marian already supports this via the option --output-approx-knn, although the feature is still considered experimental. For starters, in order to use this, the model trained must NOT have a bias at the output layer, so the configuration option --output-omit-bias must be specified at training time.

Pruning

Training a teacher model, translating the training set, and then training a student model after is very demanding in terms of computational resources. An alternative approach would be to prune the parameters of the teacher model as it is trained, reducing the model size to something of similar magnitude to a student. Unfortunately, so far student models achieve better pareto speed/quality tradeoffs than pruned models, but research is ongoing. Check out existing work on pruning whole models or just attention.

Helpful reading

Congratulations on getting this far in the tutorial! That means you are really interested in getting efficient machine translation to work. I will give a list of papers that might be useful starting points for people wanting to get more in depth into the efficient MT work.

On knowledge distillation: Sequence-Level Knowledge Distillation.
On training efficient MT systems with marian for the fast machine translation competition, years 2019, 2020, 2021
On KNN output layer shortlist Fast Locality Sensitive Hashing for Beam Search on GPU and Revisiting Locality Sensitive Hashing for Vocabulary Selection in Fast Neural Machine Translation.
On pruning: models or attention and again attention.
On architectures: Deep encoder, shallow decoder.
On human evaluation of student models and general advice regarding automatic and human evaluation.
On backtranslation, tagged backtranslation, backtranslation at scale and output sampling for backatranslation.
The previous efficient MT tutorial from the 2019 MT marathon.

Thank you everybody for participating in this tutorial! I hope it was helpful! Special thanks Roman Grundkiewicz for proofreading and adding suggestions to the tutorial.

Nick

Not all parameters are born equal! Attention is mostly what you need!

Nikolay Bogoychev — Thu, 30 Sep 2021 11:40:29 GMT

Greetings fellow NLP-kind. I got a paper in BlackboxNLP 2021, the awesome workshop that aims to shed light on how, what, and why exactly happens inside deep neural networks... So I am going to blog about it.

The premise

The Deep Neural Network is an universal function approximator, that achieves surprisingly good results on a variety of tasks, thanks to its staggeringly large number of parameters and the infinite power and wisdom of backpropagation.

Also subject to the availability of copious amounts of GPUs.

But is it really necessary to train all of those parameters? Could we just get away with training a small subset of those parameters and achieve similar performance? If we can, indeed, are some parameters more important than others?

The study

We study the value of training neural network parameters, as opposed to initialising them at random and freezing them. We perform a case study using neural machine translation and neural language modeling tasks, and transformers of various sizes and shape as the architecture.

We isolate three separate groups of parameters in a transformer: The embedding layer, the attention layer, and the FFN layer.

A simplified overview of a transformer.

We perform an ablation study where one or two of the three components are initialised at random, frozen, and never trained afterwards until the neural network converges, and note how much quality has been affected.

The findings

We studied three different transformer presets for neural machine translation: big, base and tiny. In general, we found that bigger transformers have more built-in redundancy and cope better with parts of their parameters being frozen compared to smaller transformers.

Attention is mostly what you need.

transformer-big on Turkish-English, one frozen component.

We found that when freezing one component of a transformer, the time to convergence increases slightly in terms of number of epochs, and the BLEU scores drop slightly (4%-7%). Preset (3), where we have a frozen and random FFN layer, and trainable embeddings and attention, performs the best, despite only 48% of the parameters being trainable.

transformer-big on Turkish-English, multiple frozen components.

When freezing multiple components at once, we found that the best results are achieved by having just the attention be trainable, although just having the FFN trainable produces surprisingly good results as well. The only time where the model completely fails to learn is if we just have the embeddings trainable.

Not all parameters are born equal, but they are nevertheless necessary

Despite the attention and the FFN layer being more or less self-sufficient and much more important than the embedding layer, this doesn't mean we can just remove the embedding layer, or reduce its size. We perform a number of experiments and note that when reducing the size of frozen and random components, the model's performance suffers, even if the trainable components are left untouched. This suggests that the trainable components make use of the randomly initialised transformations and the sheer number of parameters is more important than whether they are trainable or not.

In our paper we show detailed results in a variety of different combinations of neural network configurations, but the overall trend holds true across all of them.

Language models behave differently

Language models find trainable embeddings much more important for achieving lower perplexity than trainable FFN or attention layer. The drop in perplexity is also a lot more dramatic than the drop in BLEU for translation models.

Perplexity on an English transformer language model.

This suggests that our findings are likely to be task specific.

Implications

We question that vast majority of the transfer learning work that relies on pretrained choose-your-sesame-street-character embeddings for use in downstream tasks. We believe that one should always attempt to solve the downstream task with randomly initialised embeddings before using an off-the-shelf solution in order to truly show the value (or lack thereof) of pretraining.
Could this mean that we could potentially use a RNG in order to generate less important components on the fly during inference for memory efficient networks to be used on embedded devices.
Neural networks are still a blackbox. This particular work is one in a long line of research in Echo State networks.

We have a lot more details, experiments and analysis in the paper. If interested, please check it out, and come talk to us at the BlackboxNLP 2021 poster session!!

Thank you for your time!

Nick

Image sources: pexels pixabay pixabay