<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[XapaJIaMnu]]></title><description><![CDATA[Languages, Characters and Speed]]></description><link>https://nbogoychev.com/</link><image><url>https://nbogoychev.com/favicon.png</url><title>XapaJIaMnu</title><link>https://nbogoychev.com/</link></image><generator>Ghost 3.40</generator><lastBuildDate>Tue, 14 Apr 2026 12:42:23 GMT</lastBuildDate><atom:link href="https://nbogoychev.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Hanzi of the day! 盲]]></title><description><![CDATA[In the land of the dead eyes, the one eyed man is king!]]></description><link>https://nbogoychev.com/hanzi-of-the-day-17/</link><guid isPermaLink="false">69da452b8ce6010590c752d6</guid><category><![CDATA[hanzi]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 12 Apr 2026 08:43:00 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2026/04/cover.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2026/04/cover.jpg" alt="Hanzi of the day! 盲"><p>Hi everyone!</p>
<p>As my Chinese improves (or at least my <strong>confidence</strong> about it, not sure language skills wise), I chance upon lots of new characters and I try to create some mnemonics for myself to learn them.</p>
<p>Today's character <a href="https://en.wiktionary.org/wiki/盲" target="_blank">盲</a> <em>máng</em> consists of the character <a href="https://en.wiktionary.org/wiki/亡" target="_blank">亡</a> (meaning lose, flee, <strong>die</strong> and <strong>deceased</strong>) on top of an eye <a href="https://en.wiktionary.org/wiki/目" target="_blank">目</a>, making up the lovely combination of <em>death of the eye</em>!</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2026/04/wiktionary_small.png" alt="Hanzi of the day! 盲">
<center><font size="-3">The ancient forms of the glyph show the same basic structure, except sometimes death appears on the side, as opposed to on top.</font></center>
<p>There are actually a number of characters which use this particular formula, notably:</p>
<ul>
<li><a href="https://en.wiktionary.org/wiki/忘" target="_blank">忘</a> <strong>to forget</strong>, the <a href="https://en.wiktionary.org/wiki/亡" target="_blank">death</a> of the <a href="https://en.wiktionary.org/wiki/心" target="_blank">heart</a>. The astute reader will note that I have written about it on this very <a href="https://nbogoychev.com/hanzi-of-the-day-6/" target="_blank">blog</a>!</li>
<li><a href="https://en.wiktionary.org/wiki/妄" target="_blank">妄</a> <strong>preposterous</strong>, the <a href="https://en.wiktionary.org/wiki/亡" target="_blank">death</a> of a <a href="https://en.wiktionary.org/wiki/女" target="_blank">woman</a>. That one is funny, albeit it doesn't make any sense semantically. Here 亡 serves as a phonetic component, because its pronunciation <em>wàng</em> is almost the same as the pronunciation of 妄 <em>wáng</em>.</li>
</ul>
<p>Incredibly, native Chinese speakers are seldom aware of these connections (unless they are Chinese scholars) and the reason is that first and second language Chinese learners acquire character knowledge in a very different way:</p>
<p>Native Chinese speakers learn to write almost exclusively as children. They are in no rush to learn the 5000 or so characters required, they take their time over 12 years of study:</p>
<ul>
<li>Children repeatedly write characters many times over, in a specific stroke order. This primes them for a strong <a href="https://en.wikipedia.org/wiki/Enactment_effect" target="_blank">enactment effect</a>, where using the correct writing sequence helps memorise the characters. There are <a href="https://www.sciencedirect.com/science/article/abs/pii/S0346251X24000770" target="_blank">papers</a> <a href="https://www.sciencedirect.com/science/article/abs/pii/S0346251X17310849" target="_blank">about</a> it, albeit studying second language learners mostly.</li>
<li>First language speakers do not need to understand the glyph origin in order to memorise it. Logogram history and logic, while very interesting (to me), is not necessary for achieving literacy</li>
</ul>
<img align="middle" width="40%" src="https://nbogoychev.com/content/images/2026/04/children.png" alt="Hanzi of the day! 盲">
<center><font size="-3">Example child hanzi exercise book from Taiwan. After completing thousands of those, memorisation is inevitable...</font></center>
<p>Furthermore, there's an inherent difference of acquiring a logographic writing system as an adult versus as a child:</p>
<p>Adults usually don't have the same free time to allocate to the task as children. Pesky things such as jobs, chores, <em>raising their own kids</em>, etc, do tend to get in the way of the pursuit of knowledge. If I had the time, and 12 years to spare, I could learn writing Chinese in the same way as school children. I might even learn a tad bit faster, due to adults in general having better pattern recognition skills.</p>
<p>Most adults, especially nowadays, have limited <strong>scope</strong> when it comes to learning to read and write Chinese as a second language. The digital age has ensured that we mostly need to be able to read: we can use phonetic input such as <em>pinyin</em> or <em>zhuyin</em> to write Chinese. Strictly speaking, recognising and reading Chinese characters is much more important than the ability to handwrite them. At least when it comes to hobbyists like me, I am sure second language Chinese scholars have impeccable writing skills.</p>
<p>Second language Chinese learners are thus incentivised to efficiently allocate their time in a way that maximises their reading skills, not their writing skills. This inevitably means poorer writing and heavier use of mnemonics, glyph origins trivia and phonetic information as tools that aid memorisation.</p>
<p>I personally have found that writing is incredibly helpful when it comes to acquiring reading proficiency but inevitably I have to balance the time I spend writing characters versus the time I spend memorising new vocabulary.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2026/04/ignorance.jpg" alt="Hanzi of the day! 盲">
<center><font size="-1">Death on top of a book... Surely that means ignorance. I should submit that one.</font></center>
<p>That's it from me! I will <strong>see</strong> you all next time!</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://pixabay.com/illustrations/justice-equity-fairness-law-court-8845222/" target="_blank">pixabay</a> <a href="https://en.wiktionary.org/wiki/%E7%9B%B2" target="_blank">wiktionary</a> <a href="https://eword.ntpc.edu.tw/" target="_blank">eword.ntpc.edu.tw</a> <a href="https://www.pexels.com/photo/mysterious-skull-and-ancient-book-in-dark-setting-36232045/" target="_blank">pexels</a></center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hanzi of the day! 聞]]></title><description><![CDATA[The smells of spring are all over!]]></description><link>https://nbogoychev.com/hanzi-of-the-day-16/</link><guid isPermaLink="false">699ed81c8ce6010590c751ce</guid><category><![CDATA[hanzi]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Wed, 25 Feb 2026 13:37:19 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2026/02/cover.JPG" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2026/02/cover.JPG" alt="Hanzi of the day! 聞"><p>Hello Everybody!</p>
<p>Today I chanced upon a very curious character <a href="https://en.wiktionary.org/wiki/聞" target="_blank">聞</a> and I would like you to all <em>hear</em> about it!</p>
<p><a href="https://en.wiktionary.org/wiki/聞" target="_blank">聞</a> <em>wén</em> consists of an ear <a href="https://en.wiktionary.org/wiki/耳" target="_blank">耳</a> inside of a door <a href="https://en.wiktionary.org/wiki/門" target="_blank">門</a>. Surely the meaning of this character has something to do with using the ears. I know a corresponding character that is a mouth inside a door <a href="https://en.wiktionary.org/wiki/問" target="_blank">問</a>, which means to ask a question. Using the same logic, <a href="https://en.wiktionary.org/wiki/聞" target="_blank">聞</a> should mean to hear. Yes! Well, almost. It means <strong>to smell</strong>.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2026/02/wtf.jpeg" alt="Hanzi of the day! 聞">
<center><font size="-2">My reaction when learning this hanzi</font></center>
<p>I immediately started investigating! There must be a reason for this:</p>
<ul>
<li>The meaning in classical Chinese was indeed <strong>to hear</strong>, so it is one of those hanzi that changed its meaning through the ages.</li>
<li>In modern Chinese it still maintains the <strong>hear</strong> meaning in some words, such as:
<ul>
<li>新聞 <strong>news</strong>, literally translating as new hearing (I guess new smells nowadays).</li>
<li>聞名 <strong>famous</strong>, literally, a name that is heard (I guess smelled)</li>
<li>The idiom 聞所未聞 <strong>unheard of</strong>.</li>
<li>Another idiom 聞風喪膽 <strong>to be terror striken at the sound of the news</strong>. Lol, we all feel that one.</li>
<li>Many, many other idioms.</li>
</ul>
</li>
<li>The meaning in modern Japanese is indeed still <strong>to hear</strong>.</li>
<li>The oracle bone script and bronze scripts also depict it is as hearing:</li>
</ul>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2026/02/oracle_bone.png" alt="Hanzi of the day! 聞">
<center><font size="-2">A man who is covering his mouth and craning their ears to listen.</font></center>
<p>So where does the meaning of <strong>to smell</strong> come from?</p>
<p>There are some vague notions that <strong>smelling</strong> and <strong>hearing</strong> are close to each other as they are both <strong>senses</strong> so the meaning could jump around. Some languages often group senses together as words that share same root, or polysemantic words. This is indeed the case with some Austronesian or Tibeto-Burman languages, and maybe that's what happened in Chinese?</p>
<p>Or it could be phonetic changes or influence from Mon-Khmer languages?</p>
<p>In the end of the day nobody really knows.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2026/02/small_rock.gif" alt="Hanzi of the day! 聞">
<center><font size="-1">Or maybe they were predicting what news The Rock would bring up. That's my favourite explanation at the very least.</font></center>
<p>Have a nice day!</p>
<p>Nick(y)</p>
<p><font size="-4"><center>Image sources: personal uknonwn <a href="http://www.renlu.net/html/jiaguwenzidian_2324.html" target="_blank">renlu</a> <a href="https://tenor.com/view/the-rock-smell-wwe-gif-1699101244068475737" target="_blank">tenor</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Will concussions ever stop sucking? 3 year update.]]></title><description><![CDATA[More than three years have passed, and still my brain is not right.]]></description><link>https://nbogoychev.com/will-concussions-ever-stop-sucking/</link><guid isPermaLink="false">68f4584c8ce6010590c74bf5</guid><category><![CDATA[life]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 19 Oct 2025 06:24:00 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2025/10/cyber-brain.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2025/10/cyber-brain.jpg" alt="Will concussions ever stop sucking? 3 year update."><p>It's time for me to give a brief overdue update at the 3.7 year mark of my concussion accident. For those interested, here are the <a href="https://nbogoychev.com/concussions-suck-my-story/" target="_blank">4 month update</a> and the <a href="https://nbogoychev.com/concussions-still-suck-2-year-update/" target="_blank">2 year update</a>.</p>
<p>It's important to document how my recovery is going, both to serve me as a reminder, that things are getting better, and to give hope to others: Things will get better!</p>
<h1 id="iamalmostnormaluntiliamnot">I am almost normal... Until I am not</h1>
<p>What is remarkable about this stage of my recovery that I now can feel like my old self at times. I can wake up without the numbness in my face, I can look at my coworker's 60 HZ monitor, I can go to a concert, I can read a bit...  Well, maybe not reading...</p>
<p>And that's the part where it gets precarious. Just because <em>momentarily</em> I feel like my old self, doesn't mean that I am my old self. Inevitably, I will try to do something my old self would have enjoyed and then suffer for it:</p>
<ul>
<li>I used to love roller coasters and lunaparks. Back in December, I got on one for the first time in years, and while it wasn't a particularly intense one, my head didn't feel right for hours afterwards.</li>
<li>I used to headbang, just like any other metal fan. Well, I tried to do it again at a Karaoke, and, well, it didn't go well. It took me back more than a full year, with a constant headache that, no joke, lasted months. Will not do that ever again.</li>
<li>I play video games. Well, most of them. Every now and then an artistic entry such as <a href="https://en.wikipedia.org/wiki/Return_of_the_Obra_Dinn" target="_blank">The return of the Obra Dinn</a> will give me hours long motion sickness.</li>
<li>I watch animation again. Flashy new ones are fine, what trips me up are the old ones, where the animation is only 12 frames per second. No <a href="https://www.imdb.com/title/tt0417299/" target="_blank">Avatar</a>/<a href="https://en.wikipedia.org/wiki/Naruto" target="_blank">Naruto</a>/<a href="https://www.imdb.com/title/tt1219827/" target="_blank">Ghost in the Shell</a> for me, my brain can't process the moving pictures into a smooth movie and I get incredibly dizzy even after a few minutes of watching.</li>
<li>I watch movies. Most of them are fine. But some are not. Notably some of the slow zoom corridor scenes in <a href="https://www.imdb.com/title/tt17526714/" target="_blank">The Substance</a> made me feel unwell. But it's ok, since modern TVs have artificial smoothing (also known as the dreaded soap effect). It's a life saver.</li>
</ul>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/10/alicia.png" alt="Will concussions ever stop sucking? 3 year update.">
<center><font size="-2">Alicia sunk into a dream world to escape her broken self. I understand why she did that and it's a constant struggle not to.</font></center>
<p>In the end of the day, I am not back to normal, I am different still, but the focus is always on what is getting better. The view to the past is the path to depression.</p>
<h1 id="readingthebaneofmyexistence">Reading. The bane of my existence.</h1>
<p>I used to love reading, and it sucks that I can't do much of it. My eyes are still not good at the left-right movement (thanks horizontal <a href="https://en.wikipedia.org/wiki/Nystagmus" target="_blank">nystagmus</a>), and it makes reading books incredibly challenging. The hardest thing for me are long lines with single spacing. As long as the text window is narrow enough (or the line short), and there's plenty of spacing, my eyes can focus on the text with a relative ease and it doesn't bother me.</p>
<p>Thanks to these little concussion life hacks, I am able to code, and just about cope with my normal work duties, albeit reading academic papers can get quite tiring. Normal paperback books are out of the question almost entirely...</p>
<p>I should treat reading as physio exercises for injury recovery. A page of day to keep the doctor away... And wait for that neuroplasticity to rewire my brain in just right way.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/10/wiring.jpeg" alt="Will concussions ever stop sucking? 3 year update.">
<center><font size="-2">An actual image of my neuron connections, some of them hanging on pure optimism, some of them just hanging...</font></center>
<p>But it is hard to do this, because if I overdo the reading I will suffer for it: I won't be able to work for the next few days, and it's scary. It's hard to justify doing this for the recreational benefit of reading.</p>
<h1 id="othervictories">Other victories</h1>
<p>It's not all gloom and doom! I am getting back to my hobbies and I am more confident at work.</p>
<ul>
<li>I am able to read Chinese again. I resumed language classes! Recalling characters just a year ago left me with debilitating headaches even one sentence in. Now, it seems to be fine.</li>
<li>I feel I can do complicated things again, it feels like I finally have enough concentration to go through a complicated codebase and hold it in my head. I can't work as much as I used to, but I hope it will come back.</li>
<li>I am slowly looking at sports again. I've avoided them due to the potential of hitting my head, but with enough precaution, maybe it can be fine? I am tentatively looking at acrobatics, and perhaps I can even play football again, with <a href="https://en.wikipedia.org/wiki/Petr_%C4%8Cech" target="_blank">Petr Čech</a> style helmet.</li>
</ul>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/10/petr_cech.jpg" alt="Will concussions ever stop sucking? 3 year update.">
<center><font size="-2">Dude had a hole in his head and recovered, surely so can I.</font></center>
<p>Things are getting better. Maybe not as fast as I hope, but they are. I am better than 1 year ago. And 1 year ago, I was better than 2 years ago. The important thing is to not lose hope and to not overdo it. Listen to my body, relax when it tells me to relax and let it tell me which activities it can do for now and which it can't. And of course, persevere. It's not <em>I can't do this</em>, it's <em>I can't yet</em>.</p>
<p>Take care,<br>
Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://pixabay.com/photos/cyber-brain-computer-brain-7633488/" target="_blank">pixabay</a> <a href="https://www.expedition33.com/" target="_blank">Expedition 33</a> <a href="https://commons.wikimedia.org" target="_blank">wikipedia</a> <a href="https://en.wikipedia.org/wiki/Petr_%C4%8Cech" target="_blank">wikipedia</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hanzi of the day! 后]]></title><description><![CDATA[The one who stays behind the king...]]></description><link>https://nbogoychev.com/hanzi-of-the-day-yong/</link><guid isPermaLink="false">683322018ce6010590c74a7e</guid><category><![CDATA[hanzi]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 08 Jun 2025 22:34:01 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2025/06/cover.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2025/06/cover.jpg" alt="Hanzi of the day! 后"><p>A long time ago, when the eternal Elizabeth II ruled the domain, I was writing my Chinese homework and looked up the word <strong>queen</strong> in the dictionary. What I found, <a href="https://en.wiktionary.org/wiki/王后" target="_blank">王后</a> <em>wánghòu</em>, was particularly hilarious, because:</p>
<ul>
<li><a href="https://en.wiktionary.org/wiki/王" target="_blank">王</a> <em>wáng</em>, aside from one of most common Chinese sirnames, means <strong>king</strong></li>
<li><a href="https://en.wiktionary.org/wiki/后" target="_blank">后</a> <em>hòu</em> means <strong>behind</strong></li>
</ul>
<p>Technically, this character means &quot;Queen Consort&quot; and not an actual ruler, but still, very nice, very sexist, China; so the queen is literally defined as the one who is behind the king. The truth, however, is much more... Interesting? Worse? I'll let you decide.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/06/king.jpg" alt="Hanzi of the day! 后">
<center><font size="-2">A picture of a queen. She's not visible as she's behind the king...</font></center>
<p>As some of you may know, Chinese characters in mainland China went through a process called simplification, which attempted to improve the literacy rate of the population. Whether it succeeded, failed or had literally no effect is a matter of scholarly debate, but the end result is that some hanzi now have two forms: traditional and simplified.</p>
<p>One of the ways in which simplification was performed was by <a href="https://en.wikipedia.org/wiki/Simplified_Chinese_characters#Structural_simplification_2:~:text=Readopting%20abandoned%20phonetic%2Dloan%20characters%3A" target="_blank">adopting a phonetic loan</a>, and this is precisely what happened to the original character for <strong>behind</strong>. Originally spelled <a href="https://en.wiktionary.org/wiki/後" target="_blank">後</a>, it was simplified to <a href="https://en.wiktionary.org/wiki/后" target="_blank">后</a>, as they both share the same pronunciation <em>hòu</em>, but the latter is much easier to write.</p>
<p>The original meaning of <a href="https://en.wiktionary.org/wiki/后" target="_blank">后</a> is <strong>queen</strong> and has <s>nothing</s> less to do with behinds. The glyph depicts a woman giving birth to the heir of the throne:</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/06/glyph_origin_combined.png" alt="Hanzi of the day! 后">
<center><font size="-2">Ancient forms of the hanzi. Giving birth while sitting on the toilet? A curious depiction.</font></center>
<p>Now this is so much better. A queen is not defined by her position relative to the king, but by her 3D printing capabilities.</p>
<p>Looking how the glyph developed through the centuries, I can't shake off the feeling that it looks like the woman is sitting on the toilet. And indeed, some scholars argue that the glyph depicts a person and a hole, meaning <strong>rear</strong>, <strong>behind</strong> or <strong>anus</strong>. The character was sometimes used to represent this precise meaning in the <a href="https://en.wikipedia.org/wiki/Oracle_bone" target="_blank">oracle bone era</a>.</p>
<p>If this is true though, why does it also mean <strong>queen</strong>? Does it indeed depict a childbirth? Were ancient sages unfamiliar with female anatomy? Was it up to debate where the baby came from? Had they even seen a woman? We will never know.</p>
<p>Have a good day!</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://pixabay.com/photos/creative-theater-queen-georgia-1041597/" target="_blank">pixabay</a> <a href="https://commons.wikimedia.org" target="_blank">wikipedia</a> <a href="https://www.pexels.com/photo/close-up-photo-of-playing-cards-1796794/" target="_blank">pexels</a> </center></font></p>
<p>EDIT:</p>
<p>An astute reader has pointed out that this character means a queen consort, and not an actual ruler. A queen that rules is 女王 or &quot;Female King&quot;, which is not as bad as &quot;Birther of Princes&quot;.</p>
<!--kg-card-end: markdown--><p></p>]]></content:encoded></item><item><title><![CDATA[Hanzi of the day! 舞 and 無]]></title><description><![CDATA[Dancing for the rain...]]></description><link>https://nbogoychev.com/hanzi-of-the-day-13/</link><guid isPermaLink="false">5fba6adb497471067eb029b9</guid><category><![CDATA[hanzi]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 18 May 2025 21:45:43 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2025/05/dance.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2025/05/dance.jpg" alt="Hanzi of the day! 舞 and 無"><p>Hello everyone! After some hiatus, I decided I should be more serious about blogging. After all, every blog written is one character learned! Which means that I just need to write about 4986 more blog posts before I can finally read a bloody book in Chinese without consulting a dictionary on every sentence, but I digress... Today we are going to talk about Dancing. And nothingness.</p>
<p><a href="https://en.wiktionary.org/wiki/舞" target="_blank">舞</a> <em>wǔ</em> means <strong>to dance</strong> and is a rather complicated looking hanzi that is taught in beginner level Chinese. The character has a ritualistic origin: A person holding ox tails (or feathers), performing a rain dance! (A bit more obvious in the historical forms of the character shown below.)</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/05/Screenshot_20250518_213015.png" alt="Hanzi of the day! 舞 and 無">
<center><font size="-2">Which makes sense, dances are often an important part of rituals.</font></center>
<p>Now, nothing surprising so far, until I came across the phrase 無情 as I was reading some beginner books. I am thinking now, I know 舞 is dance and 情 is feeling, so this must be someone really happy or gracious. I check it in the dictionary just in case, and I was very surprised to see it meant <strong>heartless</strong>. WHAT?!?</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/05/shock.jpg" alt="Hanzi of the day! 舞 and 無">
<center><font size="-2">An actual image of me checking the dictionary.</font></center>
<p>Well, it turns out that <a href="https://en.wiktionary.org/wiki/無" target="_blank">無</a> <em>wǔ</em>, while visually very similar to <a href="https://en.wiktionary.org/wiki/舞" target="_blank">舞</a>, is actually a completely different character with distinct meaning: the absense of something, nothingness, sort of similar to the English prefix <strong>un-</strong>.</p>
<p>Historically <a href="https://en.wiktionary.org/wiki/無" target="_blank">無</a> did indeed mean <strong>to dance</strong>, but that form was <s>borrowed</s> stolen to mean negation, as this word is more common. And thus, there was no character left to dance!</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2025/05/empty_dancefloor.jpg" alt="Hanzi of the day! 舞 and 無">
<center><font size="-2">A sad, empty dancefloor...</font></center>
<p>What's the solution? Slap a pair of legs <a href="https://en.wiktionary.org/wiki/舛" target="_blank">舛</a> under <a href="https://en.wiktionary.org/wiki/無" target="_blank">無</a> to form <a href="https://en.wiktionary.org/wiki/舞" target="_blank">舞</a> and call it a day!</p>
<img align="middle" width="50%" src="https://nbogoychev.com/content/images/2025/05/voltron_small.gif" alt="Hanzi of the day! 舞 and 無">
<center><font size="-2">Just like in Voltron!</font></center>
<p>That's it from me, have a nice week!</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://www.pexels.com/photo/5-women-in-white-dress-dancing-under-gray-sky-during-sunset-175658/" target="_blank">pexels</a> <a href="https://commons.wikimedia.org" target="_blank">wikipedia</a> <a href="https://www.pexels.com/photo/man-in-brown-leather-jacket-using-binoculars-3811807/" target="_blank">pexels</a> <a href="https://www.flickr.com/photos/mccaffry/5170431771" target="_blank">flickr</a> <a href="https://www.imdb.com/title/tt0086824/" target="_blank">voltron</a></center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hanzi of the day! 𨳒, 𨳊, 𨳍, 閪]]></title><description><![CDATA[Today's edition of Hanzi of the day explores the power of logography when it comes to... profanities!]]></description><link>https://nbogoychev.com/hanzi-of-the-day-15/</link><guid isPermaLink="false">66f07a838ce6010590c74732</guid><category><![CDATA[hanzi]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Mon, 23 Sep 2024 00:14:14 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2024/09/cursing.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2024/09/cursing.jpg" alt="Hanzi of the day! 𨳒, 𨳊, 𨳍, 閪"><p>Hello hanzi lovers!<br>
It has been a while since I last wrote about my beloved Chinese characters. Sadly, life and work got in the way. Incidentally, it is my latest work that brought me to these characters. Don't ask..</p>
<p>Chinese characters in their purest form represent ideas about concepts, hence why they are called ideograms. Let us figure out the meanings of a 𨳒, 𨳊, 𨳍 and 閪 by only looking at the images.</p>
<p>The first thing that comes to mind is that there is a common element, <strong>門</strong>. The meaning of this element is a door.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/09/saloon.jpeg" alt="Hanzi of the day! 𨳒, 𨳊, 𨳍, 閪">
<center><font size="-2">It resembles a classic cowboy style saloon door.</font></center>
<p>Once characters are created, they could change in meaning, or assume new meaning in  certain context. This mix-n-match strategy is more tractable than inventing a new character for every single concept. In this case, the meaning of <strong>門</strong> refers strictly to a body part (or rather a section of the body). I am sure everyone will eventually figure out that it represents the <em>hips</em> that, according to Shakira, do not lie.</p>
<p>What about the remaining components? 小, 九, 七 and 西 have their own, individual meaning, but the important part here is that their pronunciation in Cantonese (<em>siu2 gau2 cat1 sai1</em>) rhyme with <em>diu2 gau1 cat1 hai1</em>. When put together with 門 they represent what is located <em>between</em> the hips and that's how you get and that's how you get 𨳒, 𨳊, 𨳍 and 閪. incredibly even though the primary purpose of 小, 九, 七 and 西 is to serve as a phonetic component, they also make sense from a pure ideographic perspective as well!</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/09/idea.jpg" alt="Hanzi of the day! 𨳒, 𨳊, 𨳍, 閪">
<center><font size="-2">The sheer ingenuity!</font></center>
<ul>
<li>
<p><a href="https://en.wiktionary.org/wiki/%F0%A8%B3%92" target="_blank">𨳒</a> <em>diu2</em> originally referred to the male genitalia (obviously), but the meaning later evolved to the verb <em>fuck</em>, the bread-and-butter of insults. It is used in 𨳒你老母 <em>diu2 nei5 lou5 mou2</em>, probably the most well known Cantonese phrase among the western population. I apologise on the behalf of white people.</p>
</li>
<li>
<p><a href="https://en.wiktionary.org/wiki/%F0%A8%B3%8A" target="_blank">𨳊</a> <em>gau1</em> represents the erect penis meaning <em>cocky</em> or just plain <em>stupid</em>. A friend of mine pointed out that it is specifically <em>an erect penis when it's not supposed to be</em> which also conveys the meaning of inexperience or immaturity. Pay attention to the upward pointing bit of 九 and consider how it contrasts with 七, used in 𨳍.</p>
</li>
<li>
<p><a href="https://en.wiktionary.org/wiki/%F0%A8%B3%8D" target="_blank">𨳍</a> <em>cat1</em> on the other hand corresponds to the impotent <em>penis</em>, because as we all know there is no bigger insult you can hurl towards a man. It means something that <em>ugly</em> or <em>shameful</em>, but appears in variety of other insults.</p>
</li>
<li>
<p><a href="https://en.wiktionary.org/wiki/%E9%96%AA" target="_blank">閪</a> <em>hai1</em> is dedicated to the female genitalia, and is used in a context similar to <em>cunt</em>.</p>
</li>
</ul>
<p>I am honestly genuinely impressed that <em>working</em> and <em>not working</em> penis deserve two separate characters. Evidently distinguishing between the two is important enough to deserve its own vocabulary item. Also, whole 3 characters for male genitalia, but only one for female? Sexism much?</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/09/equality2.jpg" alt="Hanzi of the day! 𨳒, 𨳊, 𨳍, 閪">
<center><font size="-2">Equality for all! More female genitalia characters!</font></center>
<p>Those are relatively new characters, used exclusively in Hong Kong and Macau. They don't really work in Mandarin because the phonetic elements don't make sense and you would not find all of them in Mandarin centric dictionaries.</p>
<p>In Mandarin, the construction of those words has occurred in a somewhat similar fashion, using the <em>body</em> 尸 radical plus a phonetic element:</p>
<ul>
<li>
<p><a href="https://en.wiktionary.org/wiki/%E5%B1%8C" target="_blank">屌</a> <em>diǎo</em> literally something <em>hanging</em> 吊 from the <em>body</em> 尸. Quite descriptive. It is likely derived from 鳥 <em>niǎo</em>, as in Chinese words for birds also mean penis. Go figure.</p>
</li>
<li>
<p><a href="https://en.wiktionary.org/wiki/%E5%B1%84" target="_blank">屄</a> <em>bī</em> literally a <em>body</em> 尸 + a <em>hole</em> 穴.</p>
</li>
</ul>
<p>This is by no means an exhaustive list of penis and vagina characters, as the list in both Mandarin and Cantonese could go on and on. Nevertheless I hope you enjoyed and appreciated to learn how logophonetic characters are constructed in a way that also makes sense if considered purely as an image!</p>
<p>Until next time!</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://www.flickr.com/photos/christopherdale/23860378" target="_blank">flickr</a> <a href="https://commons.wikimedia.org/wiki/File:Red_Dog_Saloon_31.JPG" target="_blank">wikipedia</a> <a href="https://pixabay.com/illustrations/ai-generated-lightbulb-idea-8593862/" target="_blank">pixabay</a> <a href="https://www.pexels.com/photo/man-in-blue-denim-jacket-holding-brown-cardboard-with-equality-text-5935746/" target="_blank">pexels</a></center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Concussions still suck: 2 year update]]></title><description><![CDATA[2 years later, my concussion still plagues me. I am not fully recovered yet, but I am better.]]></description><link>https://nbogoychev.com/concussions-still-suck-2-year-update/</link><guid isPermaLink="false">66abe2c8369afdcac26f7d5c</guid><category><![CDATA[life]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 18 Aug 2024 21:23:03 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2024/08/brain.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2024/08/brain.png" alt="Concussions still suck: 2 year update"><p>It has been two years since my previous <a href="https://nbogoychev.com/concussions-suck-my-story/" target="_blank">concussion post</a>, and I should give an update. Why you ask? It's not (only) because of narcissism but because I actually received emails from readers of my blog asking me whether I got better.</p>
<p>The thing is, bad concussions, especially the ones coming with <a href="https://en.wikipedia.org/wiki/Post-concussion_syndrome" target="_blank">post concussion syndrome</a>, are extremely tough on one's mental health. Since doctors can't give us, headbangers, a silver bullet solution to our predicament, we inevitably scour the Internet for information. Most importantly though, it is not the information we are all looking for. It's <strong>hope</strong>. I know I was.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/08/hope.jpg" alt="Concussions still suck: 2 year update">
<center><font size="-2">Light in the end of the tunnel, while not guaranteed, is probable.</font></center>
<p>It's important to update to story and tell people the good and the bad, because things do get better and there is always hope.</p>
<h2 id="theneurologist">The neurologist</h2>
<p>In July 2022, since I was no where near getting better, I visited a neurologist and had an MRI scan done. The MRI apparently showed that I had a small gliotic focus, a physical manifestation of banging up one's brain. According to the <a href="https://en.wikipedia.org/wiki/Gliosis" target="_blank">Internet</a> it's akin to something like <a href="https://www.ncbi.nlm.nih.gov/books/NBK326735/" target="_blank">scar tissue</a>, but in your brain, with protective, not fully understood function?</p>
<p>I asked the neurologist about it and he told me not to worry, take some rest and a chill pill. I mean, I know rest is important but surely there must be something else I can do, right? Well, he did actually prescribe me something.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/08/chill.jpg" alt="Concussions still suck: 2 year update">
<center><font size="-2">It seems that, according to doctors 99% of your problems will just go away if you simply just ignore them. I wonder if concussions aren't in the other 1% though...</font></center>
<h2 id="thedrugs">The drugs</h2>
<p>I was prescribed <a href="https://en.wikipedia.org/wiki/Piracetam" target="_blank">Piracetam</a> to take twice a day. Again, reading up on the Internet, its mechanism of action is to increase blood flow to the brain and supposedly helps older people in cognitive decline, which, technically, I was. My brain couldn't cope with basic functions such as reading and looking at a screen.</p>
<p>I was worried about taking it for prolonged periods of time, especially given the lengthy list of side effects including but not limited to tremors, anxiety, insomnia, hypersexuality... But then I found reassurance in the most unlikely of places: tech bros.</p>
<p>Piracetam is considered a nootropic drug that Silicone Valley bio hackers take 3-4-5 times a day in order to get that little extra edge over their peers. This is like 300% my prescribed dose and apparently those people are just fine?! God, that place is a dystopian nightmare, but I digress.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/08/sillicone_valley.jpg" alt="Concussions still suck: 2 year update">
<center><font size="-2">This series is way too accurate.</font></center>
<h2 id="thelight">The light</h2>
<p>I started taking Piracetam and I immediately got the tremors and the jitterness. It feels a bit like having too much coffee, except your heart is not affected. But boy did it help with EVERYTHING. I spent a full day coding. Yeah it was extremely exhausting, much more so than I was used to, but I managed to do work. No crippling dizziness, no noise in my head, just a short spell of... Normalcy.</p>
<h2 id="thedarkness">The darkness</h2>
<p>Even one pill was too much for me, in terms of side effects, so I decided I'd be doing only one a day. Even so, the insomnia was horrible. I could not fall asleep at all. Maybe 10-15 minutes here and there and the rest of the night was spent staring at the ceiling contemplating the universe. I read online that the side effects gradually disappear as your body adjusts, and this was true, but also so did the positive effects...</p>
<p>I started playing video games again in August 2022, 5 months after my accident, and I had to stop again in October because my brain wouldn't allow for it anymore. I did watch a few animes, but by October that also became unbearable. I was despairing again about my predicament, feeling like I would never be normal again.</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/08/drowning.jpg" alt="Concussions still suck: 2 year update">
<center><font size="-2">My feeling at the time</font></center>
<p>I was definitely getting better, but not as fast or as much as I wanted. There were always good days and bad days. At best I was enjoying a challenging but functional day at work and then collapsing from exhaustion once I came home. At worst, I would be unable to understand what other people were speaking to me, somehow make it to my home and wake up a few hours later completely unable to recognise where I was with no memory of how I got there. I was in my bedroom, in my bed, in my home of 5 years. That shit still haunts me.</p>
<h2 id="thebreakthrough">The breakthrough</h2>
<p>Long haul flights were a bit of a challenge as I couldn't really make use of the in flight entertainment. On one particularly boring flight I noticed a curiosity. I was at an isle seat in the left part of the plane and if I looked on my right I could watch the movies other passengers were enjoying (almost) without any ill effect, but if I looked on my left side, it was very obviously much worse. Also, the further away the screen was from me (2 isles in front of me, even 3) the better. Apparently the damage was on the right side of my brain (actually I have that information from my MRI but didn't connect the dots earlier). I tried covering up my left eye and indeed I felt a momentary relief.</p>
<p>What a remarkable discovery!</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/08/eureka.jpg" alt="Concussions still suck: 2 year update">
<center><font size="-2">Eureka!</font></center>
<p>Once I got back from my brief stint abroad, I immediately put my monitor on the right side of the desk and moved my computer sitting spot about 2 meters away from my screen. And then! WoW! It just, <em>worked</em>. I was able to spend more and more time on the computer doing more visually intense things.</p>
<p>Over the next few weeks I slowly started to reduce my distance away from the computer in an effort to teach myself how to be normal again. I watched my first movie since the accident (<em>Everything everywhere all at once</em>), sitting sideways on my couch.</p>
<p>I went to the cinema for the first time since the calamity, sat sideways and enjoyed a movie (<em>Dungeons and Dragons</em>). I looked weird, people looked at me, but I didn't care... I WAS AT THE MOVIES AGAIN! I was living again. (Technically, I was living beforehand as well, but... ups and downs, depression and all.)</p>
<h2 id="itsstillnotover">It's still not over</h2>
<p>Right now, I am fully functional. I can't work as much as I did before, but I can do a job and be considered a productive member of society. I still can't enjoy reading though..</p>
<p>Books are really hard for me. My eyes just refuse to do that left-right movement and I have to go about it very slowly. One week I decided to push through the discomfort in order to read the Three body problem and... It was bad. I relapsed completely with brain fog, memory gaps, extreme tiredness and inability to work, inability to even look at a screen.</p>
<p>This episode lasted about a month and I am extremely wary of reading now. I need to slowly introduce it to my delicate brain. It's even worse for learning languages. Reading in a foreign language triggers numbness on the left side of my head even after just a few minutes. I live knowing full well that if I am not careful I will relapse and go back months with my progress.</p>
<p>Last December I decided to go on a carousel to see if I could potentially go to a lunapark. Spoiler alert: I can't. I have vivid memories of that night lying on my bed and wondering when the ceiling would stop spinning so I could finally bloody fall asleep...</p>
<h2 id="disclaimer">Disclaimer</h2>
<p>Every concussion is individual. This is what worked for me, but it's not necessarily what would work for another person in a similar situation. First, listen to your doctor and to your body. If your body is telling you something is bad for you, don't do it.</p>
<p>But most importantly don't despair! There is <strong>hope</strong>. Do the things that you can do, and look for new hobbies. It's not about getting back to 100% your old self right away. It's about enjoying life to the best of your ability one day and one activity at a time.</p>
<p>And finally, talk to your friends and the people who love you. This was certainly the darkest time in my life (so far) and I wouldn't be able to make it without them! Thank you everyone!</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/08/homelander.gif" alt="Concussions still suck: 2 year update">
<center><font size="-2">I really do mean it!</font></center>
<p><b><a href="https://nbogoychev.com/will-concussions-ever-stop-sucking/" target="_blank">One year later update</a></b></p>
<p><font size="-4"><center>Image sources: <a href="https://pixabay.com/illustrations/futuristic-brain-cyborg-technology-8789975/c" target="_blank">pixabay</a> <a href="https://pixabay.com/photos/hands-open-candle-candlelight-1926414/" target="_blank">pixabay</a> <a href="https://pixabay.com/photos/tree-hammocks-trees-grass-summer-413714/" target="_blank">pixabay</a> <a href="https://www.imdb.com/title/tt2575988/" target="_blank">Silicon Valley</a> <a href="https://pixabay.com/photos/person-drowning-water-hand-drown-5708301/" target="_blank">pixabay</a> <a href="https://pixabay.com/illustrations/brain-eureka-think-thinking-8573309/" target="_blank">pixabay</a> <a href="https://www.imdb.com/title/tt1190634/" target="_blank">The Boys</a></center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics]]></title><description><![CDATA[Can we boost LLM inference speed by applying this one machine translation trick they don't want you to know..?]]></description><link>https://nbogoychev.com/the-ups-and-downs-of-large-language-model-inference-with-vocabulary-trimming-by-language-heuristics/</link><guid isPermaLink="false">6686e8be369afdcac26f7b79</guid><category><![CDATA[research]]></category><category><![CDATA[large language models]]></category><category><![CDATA[LLM]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 07 Jul 2024 20:08:38 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2024/07/trimming.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2024/07/trimming.jpg" alt="The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics"><p>One of the main barriers to Large Language model deployment is the cost of inference. Lowering the computational footprint without hurting the quality of the model is an extremely hot topic in research due to the <a href="https://www.firstpost.com/tech/news-analysis/openai-may-go-bankrupt-by-2024-chatgpt-costs-company-700000-dollars-every-day-12986012.html" target="_blank">staggering</a> costs that serving large language models entail.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2024/07/burning.jpg" alt="The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics">
<center><font size="-2">Footage from OpenAI's headquarters when someone asks ChatGPT what's the time of the day...</font></center>
<h3 id="theidea">The idea</h3>
<p>As <s>Machine Translation</s> Large Language model researchers, we turned our attention to the obvious culprit: The output layer which represents the vocabulary of a Large Language model. It has dimensions <em>H</em>x<em>V</em> where <strong>H</strong> is the hidden layer dimension of the model and <strong>V</strong> is the vocabulary size. <strong>V</strong> is massive. For monolingual models such as LLaMa it's around 30,000 tokens but for multilingual models such as Bloom, it is more than 250,000! The output layer is the single largest matrix in the model, consumes a lot of memory, and its multiplication is quite costly.</p>
<p>In practice, however, we <strong>never</strong> make use of the full vocabulary at the same time. Most large language model queries and generations only contain a few dozens of tokens. If we could only somehow know in advance which tokens would be used during a generation, we could dynamically trim the vocabulary to a fraction of its full size.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2024/07/cuuuut.jpg" alt="The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics">
<center><font size="-2">When LLM performance gives you lemons...</font></center>
<h3 id="theimplementation">The implementation</h3>
<p>Dynamically reducing the size of the output layer is commonly used to speed up machine translation, as we can easily predict which words are going to be used in a translation, but the output of LLMs can be unbounded. How do we filter its vocabulary?</p>
<p>We can make the assumption that if a question is asked in English, the reply would also need to be in English. So we could reduce the vocabulary to the tokens only necessary for producing that language. We came up with not one but two separate ideas about how to achieve this!</p>
<h4 id="unicodebasedtrimming">Unicode based trimming</h4>
<p>Use the alphabet. LLMs, especially multilingual ones contain vocabulary for multiple languages which are written in different scripts. We can remove all vocabulary items that are written in a different script from the one our target languages uses. We call this the <strong>unicode</strong> method, as we filter vocab based on unicode ranges.</p>
<h4 id="corpusbasedtrimming">Corpus based trimming</h4>
<p>Use a small representative corpus to build a dictionary. For example build a dictionary from newspaper articles in the language you are interested. This method has the advantage of allowing words through that possibly could be spelled in a different script (IE named entities from foreign countries may be spelled in a foreign script). We call this the <strong>corpus</strong> method.</p>
<p>Those heuristics are quite rough and we were sure we could come up with something better to further narrow the vocabulary, but we also wanted to produce an upper bound of how much performance we could hope to gain. In order to do that we performed an <strong>oracle</strong> experiment, where we run a decoding pass over 50 examples and we take a note of the vocabulary they used, and then we limit the model to use only those vocabulary tokens. This results in vocabulary of only a few thousand tokens which would be difficult to achieve in practice.</p>
<h3 id="theups">The ups</h3>
<p>We did get some ups!</p>
<ul>
<li>We managed to reduce the decoding time by up to 20% in small models (25% if we consider the <strong>oracle</strong> experiment), but this is only when it comes to (comparatively) small 560M parameter models. Bigger models see only a modest 5-10% reduction in decoding time.</li>
<li>Smaller vocabulary means less memory!  We reduce the memory usage of the output layer by a factor of 10 in most cases!</li>
</ul>
<h3 id="thedowns">The downs</h3>
<ul>
<li>Speed increase is much lower in larger models, because vocabulary plays a lesser role in their computational footprint compared to small models (which are already quite fast on modern hardware).</li>
<li>Memory reductions are insignificant for Large models, as the vocabulary represents just a tiny fraction of the total number of parameters...</li>
<li>The quality of the generation drops. We expected the reduced vocabulary to produce identical generation to the full vocabulary but it turns out that things such as URLs and code samples mandate always Latin characters, but those were not available to our Chinese and Cyrillic models, resulting in more mismatches (labeled as <em>misses</em> on the table) and poorer generation quality. Chinese seems to suffer a lot more in this regard.</li>
<li>The methods perform inconsistently with different languages. Latin script languages have harder time getting their vocabulary reduced by unicode matching; Likewise, it might be difficult to find a representative corpus for a lower resource languages:</li>
</ul>
<img align="middle" width="100%" src="https://nbogoychev.com/content/images/2024/07/CPU_BLOOM.png" alt="The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics">
<center><font size="-2">For cases where we didn't get an exact match to the genration using the full vocabulary, we computed BLEU and ChrF to estimate the quality.</font></center>
<h3 id="conclusion">Conclusion</h3>
<p>We did achieve what we hoped for! We did improve generation speed.</p>
<p>However, in practice the methods we developed have limited practical use: It's very difficult to guarantee the quality with the reduced vocabulary which is a major show stopper. Furthermore, for large models the size of GPT-4, the computational cost in the output layer is tiny fraction compared to the cost of the attention.</p>
<p>Oh well, it's not all bad. Our methods could be useful for small models in memory constrained scenarios. We also saved other researchers time by <a href="https://aclanthology.org/2024.insights-1.17/" target="_blank">publishing</a> in the <a href="https://insights-workshop.github.io/" target="_blank">Workshop on insights from negative results in NLP</a>!</p>
<p>Oh, and we got the best paper award!</p>
<img align="middle" width="80%" src="https://nbogoychev.com/content/images/2024/07/best_ppr.png" alt="The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics">
<center><font size="-2">Our small vanity corner.</font></center>
<p>If you are interested in the details, read the <a href="https://aclanthology.org/2024.insights-1.17/" target="_blank">paper</a> and cite us!</p>
<pre><code>@inproceedings{bogoychev-etal-2024-ups,
    title = &quot;The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics&quot;,
    author = &quot;Bogoychev, Nikolay  and
      Chen, Pinzhen  and
      Haddow, Barry  and
      Birch, Alexandra&quot;,
    editor = &quot;Tafreshi, Shabnam  and
      Akula, Arjun  and
      Sedoc, Jo{\~a}o  and
      Drozd, Aleksandr  and
      Rogers, Anna  and
      Rumshisky, Anna&quot;,
    booktitle = &quot;Proceedings of the Fifth Workshop on Insights from Negative Results in NLP&quot;,
    month = jun,
    year = &quot;2024&quot;,
    address = &quot;Mexico City, Mexico&quot;,
    publisher = &quot;Association for Computational Linguistics&quot;,
    url = &quot;https://aclanthology.org/2024.insights-1.17&quot;,
    pages = &quot;148--153&quot;,
}
</code></pre>
<p>Cya,</p>
<p>Nick, Patrick, Lexi, Barry</p>
<p><font size="-4"><center>Image sources: <a href="https://www.pexels.com/photo/gardener-cutting-branches-of-tree-in-garden-5231048/
" target="_blank">pexels</a> <a href="https://www.pexels.com/photo/person-holding-burning-money-7230878/" target="_blank">pexels</a> <a href=" https://www.pexels.com/photo/photo-of-person-slicing-lemon-952368/" target="_blank">pexels</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting]]></title><description><![CDATA[How do we get the best terminology machine translation system? Well, I personally don't know but asking ChatGPT nicely about it doesn't hurt...]]></description><link>https://nbogoychev.com/terminology-machine-translatioton-with/</link><guid isPermaLink="false">6561078a369afdcac26f7910</guid><category><![CDATA[code]]></category><category><![CDATA[machine translation]]></category><category><![CDATA[research]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Thu, 30 Nov 2023 14:22:51 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2023/11/logo.svg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2023/11/logo.svg" alt="Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting"><p>Language is a source of great misunderstandings. Translation, even more so. Machine translation... Well we all know how that goes:</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2023/11/smaller.jpg" alt="Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting">
<center><font size="-2">It should be <i>dried</i> vegetables, but nameless translation service provider knows better...</font></center>
<p>Now, jokes aside, here is the problem. Words have different meanings in different domains and contexts. It's impossible for translators to know all possible domains and contexts, so they make use of <em>terminology dictionaries</em>.</p>
<p>The <a href="https://wmt-terminology-task.github.io/" target="_blank">WMT 2023 terminology shared task</a> challenges researchers to apply those <em>terminology dictionaries</em> to Machine translation, and we answered the call with several distinct systems:</p>
<h3 id="terminologyawareneuralmachinetranslation">Terminology-aware neural machine translation</h3>
<p>The main idea is that we want to teach the neural network model to accept <em>hints</em> from the user about how to translate certain phrases. For example, if given the input:</p>
<blockquote>
<p>Was ist Ihr Herkunftsland?</p>
</blockquote>
<p>The model would produce:</p>
<blockquote>
<p>What is your country of origin?</p>
</blockquote>
<p>Which is correct, but we may want to influence the model to produce less formal translation:</p>
<blockquote>
<p>Was ist Ihr Herkunftsland __target__ homeland __done__?</p>
</blockquote>
<p>So that the translation changes to:</p>
<blockquote>
<p>What is your homeland?</p>
</blockquote>
<p>The neural network requires a large number of <code>GERMAN_WORD __target__ ENGLISH_WORD __done__</code> examples from a terminology dictionary during training, so that it can learn this behaviour. Unfortunately, we often don't have access to a good terminology dictionary, so we build one from our training data!</p>
<h4 id="wordalignmentbasedterminologydictionary">Word Alignment based terminology dictionary</h4>
<p>We use an IBM model to compute word alignments of our parallel training set, and then we take all the words with bijective mappings (that is to say each source word corresponds to exactly one target word) and use them as our pseudo terminology dictionary. Then, during training we randomly expose our model to 7% of those source-target terminology pairs using the <s>subliminal message</s> control sequences <code>SRC __target__ TRG __done____target__</code> on the source side. We do this using our awesome <a href="https://github.com/hplt-project/OpusTrainer" target="_blank">OpusTrainer</a> <a href="https://nbogoychev.com/opuscleaner-and-opustrainer-machine-translation-training-made-easy/" target="_blank">[blog]</a> <a href="https://arxiv.org/abs/2311.14838" target="_blank">[paper]</a>.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2023/11/yuri_transparent.png" alt="Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting">
<center><font size="-2">A real time image of me <i>trying</i> to give subliminal messages to my neural network.</font></center>
<p>Now, this is all good, and it works quite well in practise: We get up 75% terminology success rate using this approach, but we are not <em>guaranteed</em> to follow the terminology constraints: The model is free to ignore the suggestion. This is why, we built two refinement approaches on top:</p>
<h3 id="negativelyconstrainedtranslation">Negatively constrained translation</h3>
<p>Since at inference time we have access to a terminology dictionary, we can figure out when a terminology constraint has not been followed, as it would not appear in the translation. We then use <a href="https://github.com/neulab/awesome-align" target="_blank">awesome-align</a> to figure out which word was used instead of our desired terminology word. We then perform translation again, but this time we <strong>forbid</strong> that problematic word from being produced.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2023/11/erase.jpg" alt="Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting">
<center><font size="-2">If we imagine our machine translation system's vocabulary as a dictionary, negatively constrained decoding <i>amputates</i> select words.</font></center>
<h3 id="askchatgptnicely">Ask-chatGPT-nicely</h3>
<p>This approach was quite complicated and convoluted. Since we are already in the era of LLMs, we can just use the <strong>ask-chatGPT-nicely</strong> method to refine a certain translation with the desired terminology constraints. In fact, while we are at it, we decided to try and completely ditch the neural machine translation system and perform both translation and refinement using chatGPT.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2023/11/chatGPT.jpg" alt="Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting">
<center><font size="-2">State of NLP in 2023: All pray for solutions to our lord and saviour ChatGPT!</font></center>
<p>The process goes like this:</p>
<ul>
<li>Produce translation (either through our terminology-aware system, with terminology constraints, or through asking chatGPT)</li>
<li>Ask chatGPT to refine the translation, incorporating terminology constraints.</li>
</ul>
<h3 id="diditwork">Did it work?</h3>
<p>Sort of. We (UEDIN) submitted 3 separate systems: A terminology-aware base by itself, one enhanced with ChatGPT, and another enhanced with negative constraints. Our terminology-aware system by itself, or using chatGPT refinement produced the best tradeoff between following terminology compared to competing systems, at least according to <a href="https://github.com/Unbabel/COMET" target="_blank">comet</a> automatic evaluation:</p>
<table>
<thead>
<tr>
<th></th>
<th>De-&gt;En</th>
<th>En-&gt;Cs</th>
<th>Zh-&gt;En</th>
</tr>
</thead>
<tbody>
<tr>
<td>UEDIN_LLM</td>
<td><strong>0.813</strong></td>
<td><strong>0.869</strong></td>
<td><strong>0.757</strong></td>
</tr>
<tr>
<td>UEDIN_TERM</td>
<td>0.809</td>
<td><strong>0.868</strong></td>
<td>0.757</td>
</tr>
<tr>
<td>OPUS-CAT</td>
<td>0.790</td>
<td><strong>0.869</strong></td>
<td>0.521</td>
</tr>
<tr>
<td>AdaptTerm</td>
<td>0.801</td>
<td>0.841</td>
<td>0.688</td>
</tr>
<tr>
<td>UEDIN_CONSTRAINT</td>
<td>0.792</td>
<td>0.835</td>
<td>0.650</td>
</tr>
<tr>
<td>LinguaCustodia</td>
<td>0.735</td>
<td>0.834</td>
<td>0.609</td>
</tr>
<tr>
<td>VARCO-MTTSSNMT</td>
<td></td>
<td></td>
<td>0.755</td>
</tr>
<tr>
<td>BJTU-LB</td>
<td></td>
<td></td>
<td>0.751</td>
</tr>
<tr>
<td>VARCO-MTForceGen</td>
<td></td>
<td></td>
<td>0.715</td>
</tr>
</tbody>
</table>
<center><font size="-1">COMET-DA22 scores for all systems participating in the shared task, illustrating tradeoff between terminology faithfulness and translation quality. Higher is better.</font></center>
<p>Our negatively constrained translation performed rather poorly: just because we prevent the model from making one mistake, this doesn't mean it wouldn't make another one. Using ChatGPT to translate and then to perform terminology refinement produced the best translation quality/terminology tradeoff, but this is not a surprise, since it's an unconstrained system. Our terminology-aware translation system did well, losing only to ChatGPT.</p>
<p>We have a lot more details in our <a href="https://arxiv.org/abs/2310.05824" target="_blank">paper</a>, so please check it out! We worked really hard for it 🥹! You should also check out the <a href="https://wmt-terminology-task.github.io/wmt_terminology_2023.pdf" target="_blank">shared task overview paper</a>. Don't forget to cite us!</p>
<pre><code class="language-bibtex">@inproceedings{bogoychev2023terminologyaware,
      title={Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting}, 
      author={Nikolay Bogoychev and Pinzhen Chen},
      booktitle = &quot;Proceedings of the Eight Conference on Machine Translation (WMT)&quot;,
    month = dec,
    year = &quot;2023&quot;,
    publisher = &quot;Association for Computational Linguistics&quot;,
}
</code></pre>
<p>Thanks,<br>
Nick and Pinzhen</p>
<p><font size="-4"><center>Image sources: <a href="https://google.com" target="_blank">Google</a> <a href="https://www.deviantart.com/ludoxei/art/Red-Alert-2-Yuri-Prime-879483231" target="_blank">Deviant Art</a> <a href="https://pixabay.com/photos/dictionary-words-grammar-abc-390055/" target="_blank">pixabay</a> <a href="https://www.pexels.com/photo/the-adiyogi-statue-in-coimbatore-india-13041184/" target="_blank">pexels</a>  </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[OpusCleaner and OpusTrainer: Machine translation training made easy]]></title><description><![CDATA[From ancient times, parallel text and (human) neural networks have been at the heart of translation. Let's see how do we make machine translation easy and intuitive...]]></description><link>https://nbogoychev.com/opuscleaner-and-opustrainer-machine-translation-training-made-easy/</link><guid isPermaLink="false">655f9e0b369afdcac26f758d</guid><category><![CDATA[code]]></category><category><![CDATA[machine translation]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Tue, 28 Nov 2023 12:50:44 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2023/11/rosetta_stone-1.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2023/11/rosetta_stone-1.jpg" alt="OpusCleaner and OpusTrainer: Machine translation training made easy"><p>One of the big challenges that I have had to tackle as a <em>senior</em> (at least in theory) machine translation researcher is how to explain to novices in the field what a good machine translation makes.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2023/11/confused.jpg" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<center><font size="-2">It's quite terrifying to think that I am supposed to actually know how to do this.</font></center>
<p>It's quite difficult to answer this question because there's no one single correct answer. The process is very long with lots of <em>if-then-else</em> decisions that need to be taken which makes it unnecessarily confusing to newcomers. Let's illustrate the basic process:</p>
<ol>
<li>Get parallel data<br>
...</li>
<li>Train Neural Network<br>
...</li>
<li>Profit</li>
</ol>
<h2 id="paralleldata">Parallel data</h2>
<p>Unfortunately parallel data comes in many different shapes and forms. Every publicly available corpus has its own idiosyncracies and requires a personalised cleaning approach. To give some examples</p>
<ul>
<li>A lot of UN corpora have a comma at the line ending, as opposed to a full stop.</li>
<li>Some corpora don't use French quotes (« ») when translating from English.</li>
<li>Some Chinese corpora come tokenized, some don't.</li>
<li>Some corpora have the direction inversed.</li>
<li>... Probably something else we have forgotten...</li>
</ul>
<p>In order to do parallel data preprocessing right, we need to manually open and inspect each parallel corpus, see what is wrong with it, write a small script to fix it and move to the next one...</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2023/11/whack-a-mole.jpg" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<center><font size="-2">A data engineer furiously cleaning data.</font></center>
<p><em>Tedious</em> is one word that comes to mind when describing the process, especially considering that there are always dozens of distinct data sources for each language pair. And nobody wants to do something <em>tedious</em>.</p>
<h2 id="training">Training</h2>
<p>Assuming we somehow survived the data cleaning process, we are now faced with the equally daunting task of training a neural network. Easy, right?</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2023/11/wrong.jpg" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<center><font size="-2">If it were only so simple...</font></center>
<p>Neural networks are notoriously fickle and break if you even as much as sneeze on them. Not fun. The problem is that all those data sources we talked about previously, no matter how hard we try to clean them, they will differ in quality. And it just so happens that neural networks like to see really simple data when they start training, and move onto more challenging data (or perhaps even noisy web data) later on in the training process. We may also want to:</p>
<ul>
<li>Incorporate backtranslation pretraining, before we start training on the whole data.</li>
<li>Perform domain adaptation using in-domain corpora towards the end of the training.</li>
<li>Balance the mix of web and human curated data as the former exceeds the latter by a factor of 10 at least.</li>
</ul>
<p>All of those issues require stopping-and-starting the training process and swapping around the training data sources; merging different data sources together, with different sample ratios (and if you get those wrong, then you have to re-sample, re-balance and redo everything). What's the word? <em>Tedium</em>.</p>
<h2 id="humans">Humans</h2>
<p>Finally, we built our nice little machine translation system, and it is time it face the ultimate challenge: <strong>End users</strong>. You build your lovely machine translation system and you give it to your user and what do they do with it? THEY TRY TO TRANSLATE ALL CAPS TEXT. Why is this a problem? Well, we don't have that much ALL CAPS text in our training data. Our neural network wouldn't know how to process it so it will produce crap.</p>
<p>Another common usecase is translating text that contains untranslateable characters. Those can be emoji 😉, or <a href="https://en.wikipedia.org/wiki/Quran" target="_blank">wikipedia articles:</a></p>
<blockquote>
<p>The Quran (/kʊrˈɑːn/, kuurr-AHN;[i] vocalized Arabic: اَلْقُرْآنُ‎, Quranic Arabic: ٱلۡقُرۡءَانُ‎ al-Qurʾān [alqurˈʔaːn],</p>
</blockquote>
<center><font size="-2">The translation system's worst nightmare.</font></center>
<p>The reason why emoji or text in foreign script often breaks translation systems is that it has not been seen during training. Sentences with large amounts of foreign text are filtered away from the training data as noisy. So how are we going to reproduce them?</p>
<p>The best way to do this is to ensure our training data contains lots of examples of this sort, so that the neural network easily learns how to reproduce them. But how exactly do we do that? We can sprinkle emoji at random, but that's not really a good solution if the data is fixed and the same emoji always appear in the same sentences. The neural network will just learn to anticipate the sentences containing emoji and not really learn to properly copy them... Ideally we want every iteration of our data to have some sentences that include emoji, but <em>different ones every time</em>. If we use a static data source, we need to replicate it many times just so we can have our neural network see different sets of noise at each iteration... <em>Tedious</em></p>
<h2 id="thesolution">The solution:</h2>
<p>In order to solve all of those issues we created a set of open source tools <a href="https://github.com/hplt-project/OpusCleaner" target="_blank">OpusCleaner</a> and <a href="https://github.com/hplt-project/OpusTrainer" target="_blank">OpusTrainer</a>, which aim to streamline the process, remove the <em>tedium</em> and solve the aforementioned (and many many other) issues.</p>
<h3 id="opuscleaner">OpusCleaner</h3>
<p>OpusCleaner is a streamlined data-processing and visualisation toolkit that performs all the data cleaning tasks through a visual interface, minimising the number of clicks the user has to perform. First, we provide a one-stop-one-click data downloader so we can easily fetch all training data for a given language pair.</p>
<img align="middle" width="90%" src="https://nbogoychev.com/content/images/2023/11/dataset_search.png" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<p>Then, for each dataset, we can visualise a random sample from it and perform drag-and-drop chaining of various different filters that will fix any issues we notice. Such as wrong language used:</p>
<img align="middle" width="90%" src="https://nbogoychev.com/content/images/2023/11/data_cleaning_screen.png" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<p>We can Apply <strong>fastText langID</strong> and voilà, suddenly a lot of lines from our sample are filtered out:</p>
<img align="middle" width="90%" src="https://nbogoychev.com/content/images/2023/11/post_fasttext.png" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<p>Presumably the ones that needed filtering, I hope. We can chain many different filters (on the right-hand side) and see how the sample changes after applying each of them:</p>
<img align="middle" width="90%" src="https://nbogoychev.com/content/images/2023/11/filter_view.png" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<p>Finally, you can assign a human-readable label to each dataset, and apply the filtering pipeline you have just defined to the full dataset:</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2023/11/data_tayloring.png" alt="OpusCleaner and OpusTrainer: Machine translation training made easy">
<p>Tadaaa! OpusTrainer is designed to save time and to turn data exploration and visualisation into an implicit step of the data cleaning process. This teachers new users important practical data skills and saves everyone tons of time!</p>
<h3 id="opustrainer">OpusTrainer</h3>
<p>OpusTrainer is a training set generator/augmenter. It is designed to provide a fast and easy way to define and produce different data mixes and augmentation using a <code>yaml</code> configuration file. We can define a rigorous training schedule, including pretraining on backtranslation, data mix ratios, and then have OpusTrainer generate that data and feed it to <em>stdin</em> of a neural network toolkit, or alternatively write it to a file:</p>
<pre><code class="language-yml">datasets:
  bt: bt.gz # 12.4 GB
  cleanish: clean.gz # 2.4 GB
  medium: medium.gz # 1.8 GB
  dirty: dirty.gz # 33 GB

stages:
  - start
  - mid
  - end

start:
  - bt 0.9
  - clean 0.1
  - medium 0
  - dirty 0
  - until bt 1 # Until 1 epoch of bt

mid:
  - clean 0.45
  - medium 0.25
  - bt 0.1
  - dirty 0.2
  - until clean 1

end:
  - clean 0.25
  - medium 0.25
  - bt 0.1
  - dirty 0.4
  - until dirty inf

seed: 1111
</code></pre>
<p>More importantly, we can define <em>data modifiers</em>, that augment the training set on the fly with: UpperCase/TittleCase text; typos; Unicode noise (such as emoji/greek/chinese/any-random-script text), and more. And, the best part is that we only need to write a few lines of yaml to achieve this:</p>
<pre><code class="language-yml">modifiers:
  - UpperCase: 0.05 # Apply uppercase randomly to 5% of sentences. See below
  - TitleCase: 0.05
  - Typos: 0.05
  - Tags: 0.005 # This introduces emoji/foreign text tokens
    augment: 1
  - Noise: 0.0005 # This introduces foreign text full sentences
    min_word_legnth: 2 # Minumum lenght of each fake word, defaults is 2 if left blank
    max_word_length: 5 # Maximum lenght of each fake word, defaults is 5 if left blank
    max_words: 4       # Maximu number of fake words, default is 4 if left blank
</code></pre>
<p>An example French-English sentence pair from our data augmenter looks like this:</p>
<blockquote>
<p>On a connu 🙁 😬 😭 la suite ! ↔ We know 🙁 😬 😭 the rest!</p>
</blockquote>
<p>It looks a bit silly, but it gives our neural network an important signal that when it sees something silly, it should just copy it to the output and not think about it too hard :D.</p>
<ul>
<li>OpusTrainer augmentation allows us to get up to <strong>92%</strong> accuracy on copying noisy foreign text in our translation systems, up from just <strong>55%</strong> in the baseline, as described in my <a href="https://mtm23.cs.ut.ee/wp-content/uploads/2023/09/Nikolay_Bogoychev_Robustness.pdf" target="_blank">talk</a> on Robust Machine Translation, at the <a href="https://mtm23.cs.ut.ee/index.php/programme/" target="_blank">2023 MT Marathon</a>.</li>
</ul>
<p>A good example is attempting to translate the first sentence of the French <a href="https://fr.wikipedia.org/wiki/Coran" target="_blank">Wikipedia</a> article about the Quran. The sentence:</p>
<blockquote>
<p>Le Coran (en arabe : القُرْآن, al-Qurʾān?, « la récitation ») est le texte sacré de 🕌 l'islam.</p>
</blockquote>
<p>receives a somewhat lackluster translation due to the model's inability to cope with out-of-vocabulary characters.</p>
<blockquote>
<p>The Qur'an (in Arabic: ااااااااااااااااااااااااااااااااااا, al-Qurاān?, &quot;recitation&quot;) is the sacred text of u Islam.</p>
</blockquote>
<p>But after applying OpusTrainer's UTF-8 noisy augmentation we get a significant improvement:</p>
<blockquote>
<p>The Qur’an (Arabic: القُرْآن, al-Qurēn?, “recitation”) is the sacred text of 🕌 Islam.</p>
</blockquote>
<ul>
<li>OpusTrainer augmentation allows for producing high quality terminology aware systems, such as the one described in <a href="https://arxiv.org/abs/2310.05824" target="_blank">Bogoychev and Chen (2023)</a>. Terminology aware systems allow us to enforce certain words to be translated in a particular way, overriding what the system thought would be best. For example, translating this German sentence into English yields a reasonable translation:</li>
</ul>
<blockquote>
<p>Was ist Ihr Herkunftsland?<br>
What is your country of origin?</p>
</blockquote>
<p>However, using a terminology aware system, we can apply a terminology constraint and force the word <strong>homeland</strong> to appear in the translation.</p>
<blockquote>
<p>Was ist Ihr Herkunftsland __target__ homeland __done__?<br>
What is your homeland?</p>
</blockquote>
<p>Training configuraiton example for terminology-aware system is available on <a href="https://github.com/hplt-project/OpusTrainer/blob/main/contrib/test_enzh_config.yml" target="_blank">GitHub</a>.</p>
<!--kg-card-end: markdown--><hr><!--kg-card-begin: markdown--><p>This project has been the result of a large collaboration by the consortium of the <a href="https://hplt-project.org/" target="_blank">HPLT</a> project. Our goal is to make it easier for anyone to build high quality machine translation systems by creating robust and mature data cleaner and data scheduler. Come and try it out! For questions specific to either <a href="https://github.com/hplt-project/OpusCleaner" target="_blank">OpusCleaner</a> or <a href="https://github.com/hplt-project/OpusTrainer" target="_blank">OpusTrainer</a>, open a GitHub issue!</p>
<p>For more details, please refer to the <a href="https://arxiv.org/abs/2311.14838" target="_blank">paper</a>. Also, don't forget to cite us:</p>
<pre><code class="language-bibtex">@misc{bogoychev2023opuscleaner,
      title={OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models}, 
      author={Nikolay Bogoychev and Jelmer van der Linde and Graeme Nail and Barry Haddow and Jaume Zaragoza-Bernabeu and Gema Ramírez-Sánchez and Lukas Weymann and Tudor Nicolae Mateiu and Jindřich Helcl and Mikko Aulamo},
      year={2023},
      eprint={2311.14838},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
</code></pre>
<p>Thanks,<br>
Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://pixabay.com/photos/rosetta-stone-egypt-hieroglyphs-8298040/" target="_blank">pixabay</a> <a href="https://www.pexels.com/photo/man-wearing-black-and-white-stripe-shirt-looking-at-white-printer-papers-on-the-wall-212286/" target="_blank">pexels</a> <a href=" https://www.flickr.com/photos/tpapi/2765541278/" target="_blank">flickr</a> <a href="https://pix4free.org/photo/16605/wrong.html" target="_blank">pix4free</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hanzi of the day! 血 and 皿]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Thanks to me unexpectedly* contracting Covid on my trip abroad and having my holiday reduced to a confinement in a very expensive hotel on the equator, <em>Hanzi of the day</em> is back with a new edition! Today's topic is a bit biblical in nature as it includes <em>chalices</em> filled with</p>]]></description><link>https://nbogoychev.com/hanzi-of-the-day-9/</link><guid isPermaLink="false">5c98c956574e0507ac3c4cf4</guid><category><![CDATA[hanzi]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Tue, 22 Nov 2022 11:35:51 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2022/11/cover.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2022/11/cover.jpg" alt="Hanzi of the day! 血 and 皿"><p>Thanks to me unexpectedly* contracting Covid on my trip abroad and having my holiday reduced to a confinement in a very expensive hotel on the equator, <em>Hanzi of the day</em> is back with a new edition! Today's topic is a bit biblical in nature as it includes <em>chalices</em> filled with <em>blood</em>.</p>
<p>
<font size="-1">* Not really unexpectedly, as my astronomically bad luck has recently raised a dispute among senior statistitians regarding the notion of <i>Independence</i> in probability theory.</font></p>
<img align="middle" width="20%" src="https://nbogoychev.com/content/images/2022/11/blood_chalice.png" alt="Hanzi of the day! 血 and 皿">
<center><font size="-2">“This cup is the new covenant in my blood, which is poured out for you” (Luke 22:20)"</font></center>
<p>Today's character <strong>blood</strong> <a href="https://en.wiktionary.org/wiki/血" target="_blank">血</a> <em>xuè</em> is actually just a <strong>chalice</strong> <a href="https://en.wiktionary.org/wiki/皿" target="_blank">皿</a> <em>mǐn</em> with a single drop of animal blood in it. Originally the logogram 血 depicts a bronze container with animal blood used in ritual sacrifice. The two characters never lose their connection through the centuries and have evolved in parallel:</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/11/evolution.png" alt="Hanzi of the day! 血 and 皿">
<center><font size="-2">I wish I could say no animals were harmed in the making of these characters, but the leftmost script is <a href="https://en.wikipedia.org/wiki/Oracle_bone_script" target="_blank">oracle bone script</a>, and no, it's not called like that because of oracles using their own bones.</font></center>
<p>The concept of animal sacrifice is pretty much universal across cultures. Evidence of it exist in:</p>
<ul>
<li>Prehistoric <a href="https://en.wikipedia.org/wiki/Animal_sacrifice#Prehistory" target="_blank">Ancient Egypt</a>.</li>
<li>It features prominantly in all of the <a href="https://en.wikipedia.org/wiki/Animal_sacrifice#Abrahamic_traditions" target="_blank">Abrahamic religions</a>.</li>
<li>It's common across the vast majority of pagan religions (For example the <a href="https://en.wikipedia.org/wiki/Animal_sacrifice#Celtic_peoples" target="_blank">Celtic people</a>).</li>
<li>And of course, it appears in <a href="https://en.wikipedia.org/wiki/Animal_sacrifice#Han_Chinese" target="_blank">Ancient China</a>, where the value of each animal sacrifice was formalised in a strict hierachical structure. Thanks Confucious, really not sure what we would've done without it.</li>
</ul>
<p>Now, the best part of it all is that nowadays we do have a mass produced 3D version of the 血 character, which conveniently serves as a mnemonic for Chinese learners across the globe:</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/11/diva_cup.jpg" alt="Hanzi of the day! 血 and 皿">
<center><font size="-2">That's what those are for, right?</font></center>
<p>Don't catch Covid on your holidays... Or at all.</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://www.pexels.com/photo/woman-in-gray-dress-holding-drinking-glass-5554406/" target="_blank">pexels</a> <a href=" https://pixabay.com/illustrations/chalice-goblet-grail-cup-wine-4862960/" target="_blank">pixabay</a> <a href="https://en.wiktionary.org/wiki/%E7%9A%BF" target="_blank">wiktionary</a> <a href="https://en.wiktionary.org/wiki/%E8%A1%80" target="_blank">wiktionary</a> <a href="https://www.pexels.com/photo/a-hand-holding-a-menstrual-cup-7692105/" target="_blank">pexels</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Concussions suck: My story]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>15 weeks ago I went for a small ski holiday in my home country that replaced my boring workaholic life with a vivid fever dream. This is my concussion story.</p>
<h2 id="thepledge">The pledge</h2>
<p>I went for a ski holiday in Bulgaria with a few buddies back in the beginning of March.</p>]]></description><link>https://nbogoychev.com/concussions-suck-my-story/</link><guid isPermaLink="false">62ae021633c67f066d8b12b4</guid><category><![CDATA[life]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Mon, 20 Jun 2022 13:56:16 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2022/06/brain_small.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2022/06/brain_small.jpg" alt="Concussions suck: My story"><p>15 weeks ago I went for a small ski holiday in my home country that replaced my boring workaholic life with a vivid fever dream. This is my concussion story.</p>
<h2 id="thepledge">The pledge</h2>
<p>I went for a ski holiday in Bulgaria with a few buddies back in the beginning of March. It was a long weekend which meant that the ski slope was overly saturated with eager skiers and not particularly well maintained.</p>
<p>To avoid overcrowding, we took to the slopes early in the morning and had an hour or so of quite nice, relaxing skiing. Inevitably though, the clock hit 10:30 AM and after arriving to the lifts with my friend, we both groaned as we saw the massive queue. We were like 20 meters away from the queue and we were slowing down. I turned to my friend in exasperation, I got distracted and lost control of my ski: My left ski ran into a snow pile, while my right ski continued clear which resulted in me turning around 180 degrees.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/ski-injury.com_phantom_foot_photo.gif" alt="Concussions suck: My story">
<center><font size="-2">Oh, no, no, no...</font></center>
<p>And then, disaster struck, and both of my ski dug into the ground, which meant that I came to an abrupt stop, but my body was carrying a forward momentum. That momentum had to go somewhere, which meant that my ski lifted up a bit from the ground and then got wedged quite firmly in the ground. At that point, the momentum of my body was firmly pointing downwards which meant that I was slammed, back first into the ground, whiplashing my head into the snow.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/beginners.jpg" alt="Concussions suck: My story">
<center><font size="-2">Moments before disaster struck.</font></center>
<p>Ouch. I looked at the sky briefly and I thanked my past self for having the foresight to wear a helmet. I immediately developed some slight headache and my mouth started salivating a bit. <em>A micro concussion</em> I thought, as I slowly stood up. My friend came to a stop and and asked me the typical concussion questions, such as, do I know who I am, where I am, etc. Since my memory was intact, and I felt clear headed, I continued skiing for the rest of the day, albeit taking it easy.</p>
<p>We finished the day at about 3 PM, I went back to the hotel, took a painkiller for my back (unrelated injury), and went about my day. Had dinner, then went to the karaoke bar. The headache disappeared and I was feeling quite good, even sung <em>Let it Go!</em> and <em>Dancing Queen</em>. I was doing great and I thought the accident would just be a bad memory in the morning.</p>
<h2 id="theturn">The turn</h2>
<p>Late at night, I went back to my room, pulled my laptop out to do some work, just to reinforce the stereotype that academics never go on holiday and... Oh my god, I just <strong>couldn't look at the screen</strong>. It's like looking at the computer screen gave me a strong buzzing in my head and I started going dizzy. I closed the laptop and went to bed, now slightly worried. <em>Ok, I definitely have a concussion</em>. Oh boy, little did I know...</p>
<p>I woke up next morning, and I took my phone to look at the time. The buzzing in my head returned as soon as I looked at the small screen. <em>I guess I am not skiing today</em>. A friend of mine drove me home. My condition deteriorated noticeably in the next two days, just as steadily as my terror grew.</p>
<h4 id="24hoursafterimpact">24 hours after impact</h4>
<ul>
<li>Light started bothering me, like, a lot. I had to be in a dark room all the time.</li>
<li>Things were... Blurry. Fast moving objects, especially at night, looked a bit weird.</li>
</ul>
<h4 id="48hoursafterimpact">48 hours after impact</h4>
<ul>
<li>Reading became difficult and quite tiring.</li>
<li>The left part of my face went numb.</li>
<li>Taking walks made me dizzy.</li>
<li>Talking about anything complicated made me dizzy.</li>
<li>I needed a nap after literally any simple task, such as chopping vegetables for a meal.</li>
<li>I couldn't follow a conversation with more than one person.</li>
</ul>
<h4 id="72hoursafterimpact">72 hours after impact</h4>
<ul>
<li>I noticed that my world had started shaking. With every heart beat, my whole field of view would <em>jump</em>, just a little, but quite noticeable.</li>
<li>Closing my eyes didn't stop the flickering. It persisted until I fell asleep.</li>
</ul>
<p>At this point I was seriously panicking. I couldn't do any work, hell, I couldn't do almost anything. Essentially anything that moved made me dizzy. I couldn't look at people's faces when they spoke to me, as it made me dizzy. At this point, I went to see the doctor, I told her about my symptoms and she did the finger test. <strong>I could not follow the fuckin finger</strong>. I asked my doctor <em>Aren't you moving the finger a bit too fast?</em> No, she said, with a slight tremble in her voice. <em>Go and see a neurologist, now!</em></p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/examination_small.jpg" alt="Concussions suck: My story">
<center><font size="-2">I never knew how hard it is to pass this examination.</font></center>
<p>My father helped me book a neurologist appointment for the next morning and I went to sleep at about 9 PM.</p>
<p><em>I turn and toss in my dreams, having violent nightmares. I wake up sweating and think that maybe the whole thing was just a dream. I look at my phone to check the time and the buzzing in my head returns. No, this fever dream of mine is real. I fall asleep again...</em></p>
<h2 id="theprestige">The prestige</h2>
<p>The next morning, I headed for my first ever CT scan. Admittedly, all the other people there looked in a much worse shape than me and surely my injury wasn't that bad. The doctor says I'm lucky and that there's no bleeding, or brain swelling, it's just a &quot;minor concussion&quot;. Only later did I find out that they refer to any injury in which your brain is not sticking out of your skull as a &quot;minor concussion&quot;.</p>
<p>Afterwards, I went to see a neurologist, who somewhat dimissively said it's <em>just a concussion</em> and prescribed me generic over-the-counter brain supplements and a week of intravenous <a href="https://en.wikipedia.org/wiki/Mannitol" target="_blank">Mannitol</a>.</p>
<p>Then I went to see an ophthalmologist for my visual symptoms. She formally diagnosed me with <a href="https://en.wikipedia.org/wiki/Nystagmus" target="_blank">Horizontal Nystagmus</a>, which is the inability of the eyes to focus on objects. This is essentially what prevents me from reading, or doing pretty much anything. She prescribed me a month worth of travel sickness pills and told me to rest well and stay in a dark room until I recover. I was banned from any form of exercise that moves my head or raises my blood pressure, which basically left the stationary bike as the only choice.</p>
<p>Well done me. Took a 3 day holiday once a year and gave myself a concussion.</p>
<h1 id="aftermath">Aftermath</h1>
<p>At that point, I was somewhat relieved. After all, no brain swelling, no brain bleeding. I knew that most concussions resolve by themselves within two weeks, and after all at that point, I had no reason to think it would go any longer than that. Ah, such good are the times filled with hope.</p>
<p>I discovered several coping mechanisms that made my daily life easier.</p>
<ul>
<li>Take off my glasses. This myopia-enabled lifehack made sure that I walked around in a sea of blurriness and gave my brain a rest from any overly acute vision. This helped out a ton from preventing dizziness.</li>
<li>Eliminate blue light from all my devices. Less pretty, but much easier on my vision.</li>
<li>If going for a walk with someone, close my eyes and ask them to lead me like a blind person. Much easier for me to go out that way rather than on my own.</li>
<li>Do not use any lights after dark, and stay in a dark room.</li>
</ul>
<h2 id="thedullness">The dullness</h2>
<p>Oh my god, I never realised how much boredom a person can feel. Things that I could do during the day were limited to:</p>
<ul>
<li>A short walk, accompanied by someone.</li>
<li>A phone call or two with a friend, but not too long.</li>
<li>A conversation with one person, but mostly in a dark room.</li>
<li>Cooking, albeit with naps in between steps.</li>
<li>An audiobook, albeit not more than 20 minutes or so.</li>
</ul>
<p>Everything left me utterly exhausted. Small tasks such as translating my 4 line sick note into English and sending it to HR required an hour of nap afterwards to recover.</p>
<h2 id="thefriendsandfamily">The friends and family</h2>
<p>I was grateful to be surrounded by so much love from everywhere. As I was unable to read my messages, everyone promptly converted to sending voice messages to me. People made sure to check on me regularly and to try to keep me entertained with stories from their lives. Some did medical research for me, others volunteered their own experience with concussions.  Some visited me at my home and sat with me in darkness, while others accompanied me as a chaperon to important events that I could not put off or doctor's visits.</p>
<p>My supervisor went to HR and did all the annoying bureaucracy navigation so that I was promptly put on a sick leave and not bothered by anything. My PhD advisor came to see me. My coworkers and collaborators promptly took away all my duties so that I could focus on my recovery.</p>
<p>My family was available at all times to chat to me, or to drive me to my treatment, as necessary.</p>
<p>And finally, my online gaming buddies noticed my abrupt departure.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/itsgettinglate.jpg" alt="Concussions suck: My story">
<center><font size="-2">This scene does to this day give me the chills and brings tears into my eyes.</font></center>
<p>A few of them inquired about me and managed to obtain my phone number, so they could ask about my recovery. Whoever says that gamers are shallow and do not care about anything is so utterly wrong.</p>
<h2 id="thelongtwistedanduncertainroadtorecovery">The long, twisted and uncertain road to recovery</h2>
<p>In the first week or so, things got a bit better, most notably I was able to listen to audiobooks for hours at a time. I was convinced that I would be able to return to work after the two week period, but that turned out to not be the case.</p>
<h4 id="erroranderror">Error and error</h4>
<p>In the second week my recovery slowed down. I would wake up and attempt to do something and inevitably suffer for it.</p>
<p>The left side of my face was numb all the time, and it got worse if I strained myself, almost like a litmus test for what I was supposed to do and not do. Gradually I mapped for myself lists of activities that made things worse.</p>
<p>Rustling leaves, water reflections, busy streets or the sight of boiling water all made me incredibly dizzy. My eyes could not cope with multiple moving objects at once. I walked almost everywhere without my glasses.</p>
<p>Sudden head movements, such as tossing in your sleep, or skipping a step at the stairs, or taking a bus that takes a sharp turn resulted in days of headache, and noticeable worsening of my nystagmus.</p>
<p>Any amount of screen time felt bad. The small phone screen felt a lot more comfortable, but reading text with lines wider than 3 cm was impossible.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/cant_win.jpg" alt="Concussions suck: My story">
<center><font size="-2">Pretty much the only thing that felt good was lie in bed, in utter darkness and silence. Solitary confinement on the other hand is not exactly good for anyone's mental health.</font></center>
<p>This lead me to scour the Internet for answers and tips, hopefully to speed up recovery.</p>
<h4 id="surelyicoulddosomething">Surely I could do something</h4>
<p>I mean, modern medicine is great, they can cure almost anything. Well, it's true, but concussions are very tricky for several reasons:</p>
<ol>
<li>It's difficult to study them. One could easily find unvaccinated volunteers to <a href="https://www.bbc.co.uk/news/health-60229388" target="_blank">infect with covid</a> in the name of science, but I can hardly imagine a bunch of people volunteering to be given concussions in a predictable and controlled environment.</li>
<li>Every person has slightly different brain chemistry, susceptibility, and reaction to head injury. Recovery timelines, treatments vary a lot from person to person and seem to be only loosely correlated with the severity of the blow.</li>
<li>80% of concussions cases in young, healthy individual take <a href="https://www.ncbi.nlm.nih.gov/books/NBK185342/#:~:text=The%20committee%20offers%20the%20following,%2C%20months%2C%20or%20even%20years" target="_blank">under 2 weeks to recover</a>, making the other 20% not only unlucky, but looking for help from doctors who aren't sure what to make of their symptoms. <em>It should have gone away by now, do you just want to try some more paracetamol?</em></li>
<li>Most concussion research, specialist and treatment centres are localised in the USA. No idea why.</li>
</ol>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/football.jpg" alt="Concussions suck: My story">
<center><font size="-2">Not because of American Football.</font> <font size="-3"><i>Jared Wickerham/Getty Images</i></font></center>
<p>Surprisingly, the knowledge that most people recover a lot faster than me didn't make me feel better. Go figure.</p>
<p>Next, I decided to go on reddit and read about people's experience with concussion recovery, and that was, what I would call an exercise in depression. The only people that would go on reddit to discuss their concussion are the ones who do not recover. Essentially I ended up obsessively reading about other's suffering and hopelessness, while worsening my symptoms. Not fun.</p>
<p>My doctor kept saying, to just wait and wait until I feel better. No timeline on when that &quot;better&quot; is going to happen.</p>
<h4 id="thingsthathelp">Things that help</h4>
<p>Just as I stumbled on things that don't help, I also stumbled upon things that do help... Me. They might not help other people as concussions are very individual, but this is what helps me.</p>
<ul>
<li>Eliminate blue light from all your screens. It makes every single light emitting devices infinitely easier to do use.</li>
<li>High refresh rate screens. I discovered this randomly by visiting a gaming buddy and noting how his screens didn't bother me nearly as much as mine did and turned out they are high refresh rate. I replaced my phone, laptop and desktop monitors and this allowed for significantly more screen time, although it still bothered me.</li>
<li>Eliminate screens altogether, as difficult as that might be. I seldom managed.</li>
<li>Eliminate stress. Easier said than done, but the first time when my face numbness abated was when I booked a sensory deprivation chamber and managed to let go of my thoughts.</li>
</ul>
<p><em>I stay up at night, wondering if I would ever be whole again. I fall asleep in the small hours of the day and wake up in panic. The left side of my face is ever so numb, something always feels wrong.</em></p>
<h2 id="thefollowingmonths">The following months</h2>
<p>The worst thing about concussion recovery is how non-linear it is. Some days are fine, some days are not. Some days, I can work for a few hours with constant, although not-getting-worse headache. Other days, I can not even look at a screen for 5 minutes.</p>
<p>I started getting gently into exercise and that was fine until it suddenly wasn't. It's very hard to know when you overdid something, as you wouldn't feel the effects immediately, but in the next few hours, and end up spending a week in bed. You panic and you look for treatments online, but they all seem to be US based, have 3-4 month waiting time and come with a hefty pricetag. You wonder if it is worth booking one now, or you would get well on your own.</p>
<p>Your life essentially goes on a pause. Professional, personal, everything. It just stops. Your work output is close to zero and your social interactions are limited as you need to avoid crowds and noisy places.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2022/06/pause.jpg" alt="Concussions suck: My story">
<center><font size="-2">Except you don't know if there is a resume button or you have just <b>broken</b> everything.</font></center>
<h2 id="drugs">Drugs</h2>
<p>As a final bit, I will provide a list of supplements that I have been taking, hoping some of them are helping to improve my concussion. <strong>This is all research done by me, and not recommended by a doctor. I DO NOT RECOMMEND that you take any of those if you have a concussion, do so at your own risk. Consult a doctor, preferably a concussion specialist.</strong></p>
<ul>
<li>Turmeric &amp; Boswellia Serrata -&gt; Anti inflammatory, no brainer.</li>
<li>Creatine and Taurine -&gt; supposedly, they are necessary for neuron building</li>
<li>Omega 3, alpha-lipoic-acid -&gt; Generic brain supplements.</li>
<li>NAC (N-Acetyl Cysteine) -&gt; Supposedly has neuro protective effect, especially if taken right after concussion. Extensive <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3989181/" target="_blank">research</a> is done, with some human trials, but still very early stage.</li>
<li>Magnesium L-Threonate -&gt; An alternative form of magnesium, that passes the blood brain barrier. Helps some people with cluster headaches and is allegedly good for your brain.</li>
</ul>
<h2 id="thecurrentstate">The current state</h2>
<p>As of today, I'm entering week 15 of my concussion. I can mostly do non demanding computer work (hence this blog post), but coding is hit-or-miss. I struggle a lot with reading A4 length text or side-to-side text on wide screens, but it is getting better. My world flickers a bit every once in a while, especially if I overexert myself by reading, but it is also getting better.</p>
<p>I can not watch any videos or TV series that have abrupt changes in the colour scheme (eg, trailers, jumping from one scene to another). Some sports such as snooker, football are OK, as they maintain mostly static colour scheme.</p>
<p>Airports, noisy restaurants make me dizzy. I am the weird guy that goes to play pool in the bar with noise canceling headphones. Going on walks seems to be mostly fine, but on some days it's not.</p>
<p>I never realised before how much I love what I do. I am full of ideas, plans and projects that I will work on when I get better. I keep a list of cool coding projects to do, movies to watch, books to read and sports to play when I get better... And this is where the fear kicks in.</p>
<p>I am afraid to do almost anything. If I decide to jump, would I spend the next week in bed? Can I drive? Can I go swimming? Can I travel? Can I see a movie? I keep falling behind on work and missing opportunities. Will I ever go back to work? Will I have to worry about financial security? Will I ever play a video game again?</p>
<p>I <strong>hate</strong> living in fear all the time... But I am alive. Treatments for people that do not get better on their own exist and I can afford them should I need them. Concussions are a horrible thing to happen to anyone, but I mine is not a severe case and I <strong>am</strong> getting better, albeit slowly. I have friends and family that support me and I am surrounded by care and affection.</p>
<p><b><a href="https://nbogoychev.com/concussions-still-suck-2-year-update/" target="_blank">Two years later update</a></b> <b><a href="https://nbogoychev.com/will-concussions-ever-stop-sucking/" target="_blank">Three years later update</a></b></p>
<p><font size="-4"><center>Image sources: <a href="https://www.istockphoto.com/photo/mri-brain-with-headache-gm938046810-256531635" target="_blank">imageStock</a> <a href="https://www.e4s.co.uk/docs/top-skiing-tips.htm" target="_blank">e4s</a> <a href="https://tangotribe.com/learning-how-to-fall/ski-injury-com_phantom_foot_photo/" target="_blank">tangotribe</a> <a href="https://www.istockphoto.com/photo/doctor-diagnosing-injured-woman-gm512217774-87057247" target="_blank">imageStock</a> <a href="https://www.reddit.com/r/gaming/comments/odwetf/see_you_soon/" target="_blank">reddit</a> <a href="https://pixabay.com/photos/right-left-a-notice-traffic-signs-2620946/" target="_blank">pixabay</a> <a href="https://www.npr.org/sections/health-shots/2013/01/31/170764982/are-nfl-football-hits-getting-harder-and-more-dangerous?t=1655653220169" target="_blank">npr</a> <a href="https://www.flickr.com/photos/armydre2008/3949049470" target="_blank">flickr</a></center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Efficient machine translation]]></title><description><![CDATA[When your hardware resources are deficient, you have no choice but to go efficient! In this tutorial we will learn how to produce efficient machine translation models, with Marian as our NMT engine of choice, but the methods can easily be transferred to other toolkits.]]></description><link>https://nbogoychev.com/efficient-machine-translation/</link><guid isPermaLink="false">6194d760552d66065a7def43</guid><category><![CDATA[code]]></category><category><![CDATA[tutorial]]></category><category><![CDATA[machine translation]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Sun, 21 Nov 2021 22:39:43 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2021/11/MT-Marathon.svg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><aside class="toc"></aside>
<!-- here's where it's at: https://ghost.org/docs/tutorials/adding-a-table-of-contents/ ><!--kg-card-end: html--><!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2021/11/MT-Marathon.svg" alt="Efficient machine translation"><p>The tutorial run on 22.11.2021. A recording of the live session is available on <a href="https://youtu.be/mrC3YOW-NQs" target="_blank">Youtube</a>. Also if you have any questions, don't hesitate to email me.</p>
<h1 id="prerequisites">Prerequisites</h1>
<p>Download and install <code>sacrebleu</code>:</p>
<pre><code class="language-bash">pip3 install sacrebleu
</code></pre>
<h2 id="gettingmarian">Getting marian</h2>
<p>Download and install <a href="https://marian-nmt.github.io" target="_blank">Marian</a>:</p>
<pre><code class="language-bash">git clone https://github.com/marian-nmt/marian-dev.git
cd marian-dev
mkdir build
cd build
cmake .. -DUSE_FBGEMM=ON -DCOMPILE_CUDA=OFF -DCOMPILE_CPU=ON
make -j4
</code></pre>
<p>Note that marian requires <code>intel-mkl</code> for CPU decoding and <code>CUDA</code> for GPU decoding. Please make sure that you have <code>MKL</code> installed on your local machine before compiling. The package name could differ between distros. For Ubuntu 20.04 please install <a href="https://packages.ubuntu.com/focal/intel-mkl" target="_blank">this</a>. Alternatively, you can use do this to install it directly on ubuntu 16.04 or newer:</p>
<pre><code class="language-bash">wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB' | sudo apt-key add -
sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main &gt; /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update
sudo apt-get install intel-mkl-64bit-2020.0-088
</code></pre>
<p>More details on the whole Marian and MKL issue are listed on <a href="https://marian-nmt.github.io/docs/#cpu-version" target="_blank">Marian's website</a>.<br>
If you have nvidia GPU and <code>CUDA</code> installed on your local system, you can switch the <code>COMPILE_CUDA</code> flag to ON.</p>
<p>Marian should compile cleanly on Linux, WSL and Mac. For Mac users, you want to set up a dev environment and use the built in Apple accelerate framework by providing the cmake flag <code>-DUSE_APPLE_ACCELERATE=ON</code>.</p>
<h2 id="gettingthedata">Getting the data</h2>
<p>Download and extract the test models tarball.</p>
<pre><code class="language-bash">wget http://data.statmt.org/nbogoych/mt_marathon.tar.gz
tar -xvf mt_marathon.tar.gz
cd mt_marathon
</code></pre>
<p>Note that <code>data.statmt.org</code> might be slow to respond due to the number of concurrent users, so you could try instead this mirror link:</p>
<pre><code class="language-bash">wget https://nbogoychev.com/files/mt_marathon.tar.gz
</code></pre>
<p>Now your directory structure should look like this:</p>
<pre><code class="language-bash">$ tree
.
├── enes.student.tiny11
│   ├── basic.translation.sh
│   ├── batched.shortlisted.8bit.translation.sh
│   ├── batched.shortlisted.translation.sh
│   ├── batched.translation.sh
│   ├── config.batched.shortlisted.8bit.yml
│   ├── config.batched.shortlisted.yml
│   ├── config.batched.yml
│   ├── config.yml
│   ├── lex.s2t.bin
│   ├── lex.s2t.gz
│   ├── model.intgemm8.bin
│   ├── model.npz
│   └── vocab.esen.spm
└── enes.teacher.bigx2
    ├── basic.translation.sh
    ├── batched.translation.sh
    ├── config.batched.yml
    ├── config.yml
    ├── model1.npz
    ├── model2.npz
    └── vocab.esen.spm
</code></pre>
<h1 id="mtdecoding">MT decoding</h1>
<p>In this section we will gradually try different marian settings and models, starting from the slowest and progressing to the fastest.</p>
<h2 id="theteachermodel">The teacher model</h2>
<p>The teacher model refers to the highest quality system that you train for any translation task. This is the system trained with &quot;all bells and whistles&quot;, but unfortunately it is also quite slow. We will also talk about training it later on.</p>
<p>In our system, the teacher model is an ensemble of 2x transformer-big for English-Spanish.</p>
<p>This is a fairly big model and I do not recommend that you try to run it now during the tutorial. You should definitely run it later on, on a cluster just so that you see how much time it takes to translate. I will publish the results here for test runs on my machine. The basic configuration of the teacher model is found by looking through <code>config.yml</code> file and the <code>basic.translation.sh</code> script. I will summarize and explain it here.</p>
<pre><code class="language-bash">$cat config.yml 
relative-paths: true
models: # Selects models for ensembling
  - model1.npz
  - model2.npz
vocabs: # Selects vocabulary for each model
  - vocab.esen.spm
  - vocab.esen.spm
beam-size: 4 # beam search size
normalize: 1 # length normalisation
word-penalty: 0 # length penalty
</code></pre>
<p>And the script:</p>
<pre><code class="language-bash">cat basic.translation.sh 
#!/usr/bin/env bash

MARIAN=../../marian-dev/build # Path to your marian installation

SRC=en
TRG=es

mkdir -p basic

sacrebleu -t wmt13 -l $SRC-$TRG --echo src &gt; basic/newstest2013.$SRC # get the test set using sacrebleu


# Call marian
echo &quot;### Translating wmt13 $SRC-$TRG on CPU. Extra flags $@&quot;
$MARIAN/marian-decoder -c config.yml $@ \
    --quiet --quiet-translation --log basic/gpu.newstest2013.log \
    -i basic/newstest2013.$SRC -o basic/basic.newstest2013.$TRG

# Print the time it took for translation, and the BLEU scores
tail -n1 basic/gpu.newstest2013.log
sacrebleu -t wmt13 -l $SRC-$TRG &lt; basic/basic.newstest2013.$TRG | tee basic/basic.newstest2013.$TRG.bleu
</code></pre>
<h3 id="baselinesystem">Baseline system</h3>
<p>Run the baseline system, replacing <code>N</code> with the number of real cores your CPU has.</p>
<pre><code class="language-bash">$ cd enes.teacher.bigx2
$ ./basic.translation.sh --cpu-threads N
### Translating wmt13 en-es on CPU. Extra flags --cpu-threads 16
[2021-11-19 10:45:40] Total time: 4597.81777s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 36.5,
 ...
}
</code></pre>
<p>This script and all following scripts will output translation time and the BLEU score of the test set. I have trunkated the output in the interest of readability</p>
<h3 id="minibatchsize">Mini batch size</h3>
<p>The problem with the baseline system is that we are only ever translating one sentence at a time, which makes our matrices tall and skinny, which is not something that modern hardware likes. Instead, we're now going to group sentences to be translated together. We do that by augmenting the configuration with the following options:</p>
<pre><code class="language-bash">mini-batch: 16 # Sentences to be translated together
maxi-batch: 100 # Look at the next 100 sentences to sort sentences with similar length
maxi-batch-sort: src # Sort by source sentence length
workspace: 4000 # Memory budget per worker.
</code></pre>
<p>You could also add those options directly to the Marian command in the script as I have done. Run it with:</p>
<pre><code class="language-bash">$ ./batched.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 11:24:43] Total time: 652.14863s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 36.5,
 ...
}
</code></pre>
<p>As you can see, specifying batching dramatically increases the translation speed.</p>
<h2 id="thestudentmodel">The student model</h2>
<p>The student model is what typically machine translation providers will run on their production services. Those models are highly optimised for speed, with architectures, that typically offload the heavy weight attention in the decoder with something faster, such as the Simplified Simple Recurrent Unit (SSRU).</p>
<p>The student model is trained on the output of the teacher model(s in the case of ensembling). More details about the training and the exact specifics of the architecture will be given later. There are couple of crucial things we need to know about decoding with student models:</p>
<ul>
<li>Beam search is unnecessary! Quality is the same regardless of the beam size.</li>
<li>No need for ensembling either.</li>
<li>The student model is tiny compared to the teacher: Less than 10 times the number of parameters!</li>
<li>For more information, check <a href="https://aclanthology.org/D16-1139/" target="_blank">this paper</a>.</li>
</ul>
<p>All of that allows the student to produce translations much faster, at a small cost of BLEU.</p>
<p>The config of a basic student model is similar to the that of the teacher, except, that the beam size is set to one. Furthermore, since the beam size is one, we don't actually need to compute the expensive softmax, but we can instead just take an argmax:</p>
<pre><code class="language-bash">$ cd enes.student.tiny11/
$ cat config.yml 
relative-paths: true
models:
  - model.npz
vocabs:
  - vocab.esen.spm
  - vocab.esen.spm
beam-size: 1 # Beam size to one
normalize: 1.0
word-penalty: 0
max-length-factor: 2.0 # The target sentence shouldn't be longer than 2x the source sentence
skip-cost: true # Do not compute softmax but instead take argmax
</code></pre>
<h3 id="baseline">Baseline</h3>
<p>To run the baseline system, just do:</p>
<pre><code class="language-bash">$ cd ../enes.student.tiny11
$ ./basic.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:03:11] Total time: 84.66556s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.2,
 ...
}
</code></pre>
<h3 id="minibatchsize">Mini batch size</h3>
<p>Just like in the teacher case, we can significantly improve translation time by specifying larger mini-batch size</p>
<pre><code class="language-bash">$ ./batched.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:03:11] Total time: 11.66556s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.2,
 ...
}
</code></pre>
<h3 id="shortlisting">Shortlisting</h3>
<p>Further improvements in the translation speed may be achieved by avoiding the largest multiplication at the output layer, by supplying the model with a lexical shortlist. The lexical shortlist filters the output layer to only contain words that are deemed to be likely translations of the input sentence, potentially reducing its size from 30k to about 1k. We will walk through the construction of a lexical shortlist, but you can see the speed improvement from the example script:</p>
<pre><code class="language-bash">$ ./batched.shortlisted.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:05:03] Total time: 8.41856s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.2,
 ...
}
</code></pre>
<p>The only change we made is the inclusion of the shortlist parameter inside <code>config.batched.shortlisted.yml</code>:</p>
<pre><code class="language-bash">shortlist:
    - lex.s2t.bin
    - false
</code></pre>
<p>The lexical shortlist is similar to a dictionary containing frequently associated source and target words. We will look at how it's trained <a href="#producingwordalignments">later</a> in this tutorial.</p>
<h3 id="quantisation">Quantisation</h3>
<p>To further improve performance we can perform the neural network inference in lower precision numerical format, such as 8 bit integers, which runs much faster on CPUs compared to plain old floats. To do so, we must first covert our model to 8-bit integer format:</p>
<pre><code class="language-bash">$MARIAN/marian-conv -f model.npz -t model.intgemm8.bin -g intgemm8
</code></pre>
<p>and then decode with it:</p>
<pre><code class="language-bash">$ ./batched.shortlisted.8bit.translation.sh --cpu-threads N
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 16
[2021-11-19 12:05:58] Total time: 7.12242s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.0,
 ...
}

</code></pre>
<p>That's all exciting right? Now, let's dive deep down, and see how we can get to those student models</p>
<h2 id="resultssummary">Results summary</h2>
<p>Here is the summary of the results that I run:</p>
<table>
<thead>
<tr>
<th>System, 16 threads, 3000 sentences</th>
<th>time</th>
<th>BLEU</th>
</tr>
</thead>
<tbody>
<tr>
<td>Teacher basic</td>
<td>4597s</td>
<td>36.5</td>
</tr>
<tr>
<td>Teacher batched</td>
<td>652s</td>
<td>36.5</td>
</tr>
<tr>
<td>Student basic</td>
<td>84s</td>
<td>35.2</td>
</tr>
<tr>
<td>Student batched</td>
<td>11s</td>
<td>35.2</td>
</tr>
<tr>
<td>Student batched shortlisted</td>
<td>8s</td>
<td>35.2</td>
</tr>
<tr>
<td>Student batched quantised shortlisted</td>
<td>7.1s</td>
<td>35.0</td>
</tr>
</tbody>
</table>
<p>Unfortunately, as the systems get faster, the runtime between different systems gets muddled due to initialisation time overhead. In order to show better the effect of all our options on the runtime, I will repeat all student experiments using 1 thread:</p>
<table>
<thead>
<tr>
<th>System, 1 thread, 3000 sentences</th>
<th>time</th>
<th>BLEU</th>
</tr>
</thead>
<tbody>
<tr>
<td>Student basic</td>
<td>189s</td>
<td>35.2</td>
</tr>
<tr>
<td>Student batched</td>
<td>38s</td>
<td>35.2</td>
</tr>
<tr>
<td>Student batched shortlisted</td>
<td>27s</td>
<td>35.2</td>
</tr>
<tr>
<td>Student batched quantised shortlisted</td>
<td>21s</td>
<td>35.0</td>
</tr>
</tbody>
</table>
<p>We get a huge gain when going from unbatched to batched translation, regardless of the model size. We gain about 30% efficiency from shortlisting and then another 23% from quantising to 8 bit integers. We do sustain some drop in BLEU in the meantime though.</p>
<h1 id="howtotrainyourownefficientmodel">How to train your own efficient model</h1>
<p>In this part of the tutorial we will go through the steps necessary for preparing efficient machine translation system. Due to time and resource constraints (AKA training taking for-fuckin-ever).</p>
<h2 id="trainingagoodteacher">Training a good teacher</h2>
<p>A good student can not possibly hope to learn without the help from a marvelous teacher.</p>
<h3 id="cleanyourdata">Clean your data!</h3>
<p>Before starting training do your customary data cleaning. Most people use custom scripts for this which are tailored towards the specific language pair, but the bare minimum can be achieved using good old moses <a href="https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl" target="_blank">clean-corpus-n.perl</a>:</p>
<pre><code class="language-bash">$ mosesdecoder/scripts/training/clean-corpus-n.perl data/corpus.uncleaned $SRC $TRG data/corpus.tok 1 100
</code></pre>
<p>In this particular example, sentences with length of less than 1 token or above 100 tokens are excluded. Modify this accordingly.</p>
<h3 id="trainareversemodelforbacktranslation">Train a reverse model for back-translation</h3>
<p>You can skip this step if you do not have any monolingual data</p>
<p>In order to train a high quality system, we usually take advantage of the available monolingual resources in the target language. This is done by training a translation model in the reverse direction and then translating the monolingual corpora with it. For more details, please check this <a href="https://aclanthology.org/P16-1009/" target="_blank">paper</a>. A typical marian configuration for the purpose would look like this:</p>
<pre><code class="language-bash">devices: 0 1 2 3
workspace: 12000
log: model/train.log
model: model/model.npz
train-sets:
  - train.clean.de
  - train.clean.en
seed: 1111
vocabs:
  - model/vocab.deen.spm
  - model/vocab.deen.spm
task: transformer-base
dim-vocabs:
  - 32000
  - 32000
shuffle-in-ram: true
# Validation set options
valid-sets:
  - dev.de
  - dev.en
valid-freq: 5000
valid-metrics:
  - ce-mean-words
  - perplexity
  - bleu-detok
disp-freq: 1000
early-stopping: 10
beam-size: 6
normalize: 0.6
max-length-factor: 3
maxi-batch-sort: src
mini-batch-fit: true
valid-mini-batch: 8
valid-max-length: 100
valid-translation-output: model/valid.bpe.en.output
keep-best: true
valid-log: model/valid.log
</code></pre>
<p>This configuration assumes a 16 GB GPU. If you have a smaller GPU, please adjust down the <code>workspace</code> accordingly.</p>
<p>After the model is trained, translate your monolingual data using the <code>output-sampling</code> option which has shown to produce better results. Furthermore, monolingual data might be dirty, so make sure you set <code>max-length</code> and <code>max-length-crop</code> crop options. You should append those to your configuration file used for translation:</p>
<pre><code class="language-bash">output-sampling: true
max-length: 100
max-length-crop: true
</code></pre>
<p>Some works suggest that synthetic data should be <a href="https://aclanthology.org/W19-5206/" target="_blank">tagged</a> when appending it to the true data. How much backtranslated data to use is an open question. Generally the more the better, although you may want to balance/upweight/upsample true data if it is too little compared to the backtranslate data. Refer to <a href="https://aclanthology.org/D18-1045/" target="_blank">Facebook's</a> wisdom.</p>
<h3 id="traintheteacher">Train the teacher</h3>
<p>Once you have your synthetic backtranslated data (optional) and your parallel corpora, you can proceed to training your teacher model.</p>
<h4 id="configurationchoice">Configuration choice</h4>
<p>You could train your teacher with two separate configuration prefixes: Either <code>task: transformer-base</code> or <code>task: transformer-big</code>. As a rule of thumb, if you have a high resource language &gt;5M sentence pairs, you will likely see gains from using <code>transformer-big</code>.</p>
<ul>
<li>For <code>transformer-base</code> you can reuse the configuration setting posted earlier.</li>
<li>For <code>transformer-big</code>, adjust down the workspace to 10000 and add the option <code>optimizer-delay: 2</code> to the configuration.</li>
</ul>
<p>No need to adjust any other configuration settings, as the <a href="https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L73" target="_blank">task alias</a> takes care of assigning the rest of the model hyperparameters.</p>
<h4 id="ensembling">Ensembling</h4>
<p>One very easy way to improve translation quality of the teacher is to produce an ensemble of systems that produce translation together. This is done by training identical systems, initialising them with different random seed. The more systems, the better, although returns are diminishing.</p>
<p>For example, if we want to have an ensemble of two systems, we need to separate configuration files for training, where the <code>seed</code> parameter is different. Configuration one would have <code>seed: 1111</code>, whereas configuration two would have <code>seed: 2222</code>. At decoding time, don't forget to load all produced models as shown <a href="#theteachermodel">earlier in the tutorial</a>.</p>
<h2 id="trainingthestudent">Training the student</h2>
<p>The student model is trained to approximate the teacher distribution. In this manner we can achieve translation quality approximating that of a teacher model, at a just a fraction of the computational cost. For more information, check <a href="https://aclanthology.org/D16-1139/" target="_blank">this paper</a>. Up to date guide with code and scripts for training student models can be found on <a href="https://github.com/browsermt/students/tree/master/train-student" target="_blank">github</a>.</p>
<h3 id="producingthedistilledtrainingdata">Producing the distilled training data</h3>
<p>The training data for the student is produced by translating the <strong>complete</strong> training set with your teacher ensemble. This is a cumbersome task, because the teacher model is heavyweight. I recommend that you use the same settings as the ones discussed <a href="#minibatchsize">earlier in this tutorial</a>.</p>
<p>It is possible to quantise the teacher model(s) before translating, but depending on the language pair and configuration, this might lead to a substantial drop in the quality of the translated data. More details later.</p>
<p>If you are decoding on a fairly recent NVIDIA GPU, feel free to add the <code>fp16: true</code> to the decoder configuration, in order to use 16 bit float decoding.</p>
<p>More details about these will be given <a href="#advancedtopics">later in the tutorial</a>.</p>
<h3 id="producingwordalignments">Producing word alignments</h3>
<p>Before we can proceed to training the student model we need to produce IBM model alignments using <a href="https://github.com/clab/fast_align" target="_blank">fast-align</a>. The alignments are necessary so that our models have guided alignment, baked in, as well as the lexical shortlist.</p>
<p>The script that takes care of everything is located on <a href="https://github.com/browsermt/students/tree/master/train-student/alignment" target="_blank">github</a>. You need to edit <code>generate-alignment-and-shortlist.sh</code> in order to put the locations to the corpora and the SPM trained vocabulary.</p>
<h3 id="trainingthestudent">Training the student</h3>
<p>Now that the teacher model(s) have translated the full training set, we can use that as the input to the student model. The student model is trained on the original source text, and the <em>synthetic</em>, translated target text. In our experiments so far, the student models trained used this configuration, dubed <code>tiny</code>:</p>
<pre><code class="language-bash">$ cat config.yml
dec-cell: ssru
dec-cell-base-depth: 2
dec-cell-high-depth: 1
dec-depth: 2
dim-emb: 256
enc-cell: gru
enc-cell-depth: 1
enc-depth: 6
enc-type: bidirectional
tied-embeddings-all: true
transformer-decoder-autoreg: rnn
transformer-dim-ffn: 1536
transformer-ffn-activation: relu
transformer-ffn-depth: 2
transformer-guided-alignment-layer: last
transformer-heads: 8
transformer-no-projection: false
transformer-postprocess: dan
transformer-postprocess-emb: d
transformer-preprocess: &quot;&quot;
transformer-tied-layers:
  []
transformer-train-position-embeddings: false
type: transformer
</code></pre>
<p>and the following training script:</p>
<pre><code class="language-bash">#!/bin/bash -v

# Set GPUs.
GPUS=&quot;0 1 2 3&quot;
MARIAN=../../marian-dev/build

SRC=en
TRG=es

# Add symbolic links to the training files.
test -e corpus.$SRC.gz || exit 1    # e.g. ../../data/train.en.gz
test -e corpus.$TRG.gz || exit 1    # e.g. ../../data/train.es.translated.gz
test -e corpus.aln.gz  || exit 1    # e.g. ../../alignment/corpus.aln.gz
test -e lex.s2t.gz     || exit 1    # e.g. ../../alignment/lex.s2t.pruned.gz
test -e vocab.spm      || exit 1    # e.g. ../../data/vocab.spm

# Validation set with original source and target sentences (not distilled).
test -e devset.$SRC || exit 1
test -e devset.$TRG || exit 1

mkdir -p tmp

$MARIAN/marian \
    --model model.npz -c student.tiny11.yml \
    --train-sets corpus.{$SRC,$TRG}.gz -T ./tmp --shuffle-in-ram \
    --guided-alignment corpus.aln.gz \
    --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 \
    --max-length 200 \
    --exponential-smoothing \
    --mini-batch-fit -w 9000 --mini-batch 1000 --maxi-batch 1000 --devices $GPUS --sync-sgd --optimizer-delay 2 \
    --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 \
    --cost-type ce-mean-words \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 \
    --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --disp-first 10 \
    --valid-metrics bleu-detok ce-mean-words \
    --valid-sets devset.{$SRC,$TRG} --valid-translation-output devset.out --quiet-translation \
    --valid-mini-batch 64 --beam-size 1 --normalize 1 \
    --early-stopping 20 \
    --overwrite --keep-best \
    --log train.log --valid-log valid.log
</code></pre>
<p>The script takes care of checking for all the necessary files and will fail if they are missing.</p>
<p>There are other possible student configuration. We have a <code>base</code> configuration prefix <a href="https://github.com/browsermt/students/tree/master/train-student/models>" target="_blank">on github</a>, which is slower than the <code>tiny</code> shown above but delivers better performance. Experiment with different configurations until you find something acceptable</p>
<p>Note that the student model will take forever to train. You want to really overfit to the outputs of the teacher, so going for many epochs over the data is necessary. You could see the BLEU score improvement stalling for many consecutive validation steps before improving again.</p>
<h3 id="quantisationfinetuning">Quantisation fine-tuning</h3>
<p>Finally we will talk about quantisation fine-tuning. When we quantise the model to a lower precision, we damage it. The model might not do well with that damage right out of the box so instead we are going to do fine tune it by training very briefly with a GEMM that mimics the damage from quantisation. To do this add the following to the configuration file:</p>
<pre><code class="language-bash">quantize-bits: 8
</code></pre>
<p>Finetuning is really fast. The model's quality is going to start going down after a few thousand mini-batches. Make sure you have frequent validations so that you don't miss the sweet spot! (<code>valid-freq: 200</code> would be good).</p>
<h3 id="results">Results</h3>
<p>A significant amount of compute time is required to train an efficient student model, so we can't do that for the duration of this tutorial. However, we can show you what you can achieve in practice! Take a look at our blazing fast, privacy focused, cross-platform translation app <a href="https://translatelocally.com" target="_blank">translateLocally</a>, which is only made possible when utilising all of the techniques outlined above:</p>
<img align="middle" width="85%" src="https://nbogoychev.com/content/images/2021/11/translatelocally.gif" alt="Efficient machine translation">
<h2 id="caveats">Caveats</h2>
<p>Optimising for speed doesn't come without caveats. Translation quality drops to a certain extent. The drop is not uniform across models, so test before you deploy!</p>
<ul>
<li>Quantisation affects different models differently. As a rule of thumb the smaller the student is the more it loses from quantisation, but very large teachers models have at times been shown to work quite poorly with quantisation. Always test before you deploy!</li>
<li>Lexical shortlisting is known to cause quality issues when used with very small mini-batch size. Proceed with caution when translating single sentences and using a lexical shortlist. This can somewhat be ameliorated by letting the shortlist be more conservative and thus increasing the number of vocabulary items it lets through during construction: <code>$MARIAN/marian-conv --shortlist lex.s2t.gz 100 100 0 --vocabs vocab.esen.spm vocab.esen.spm -d lex.s2t.bin</code>. <code>100 100</code> means take top 100 words from the vocabulary, and top 100 translations of each word according to the shortlist. Increasing this further will slow down translation, but improve quality.</li>
<li>Humans do <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.68.pdf" target="_blank">notice the difference</a> between teacher and student model. Also, METEOR scores which are shown to have better <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.57.pdf" target="_blank">correlation with human judgements</a> also favour teacher models. There is no free performance gain.</li>
</ul>
<h1 id="advancedtopics">Advanced topics</h1>
<p>In this subsection we will talk about advanced topics, that you may be interested if you are in the business of providing commercial machine translation systems or want to do a PhD in the subfield of Machine translation.</p>
<h2 id="hyperparametertuning">Hyperparameter tuning</h2>
<p>Carefully tune your hyperparameters when decoding! Different combinations of models and hardware behave differently. More specifically, <code>mini-batch: 16</code> is not a fast and hard hyperparameter. Depending on the CPU you have, the amount of cache it has, the amount of system memory (RAM) you have, there might be other optimal settings. Experiment with <code>mini-batch</code>, <code>maxi-batch</code> and <code>workspace</code> parameter until you arrive at an optimal solution for your specific configuration.</p>
<h2 id="efficientgpudecoding">Efficient GPU decoding</h2>
<p>Efficient GPU decoding differs from efficient CPU decoding in several key aspects:</p>
<ul>
<li>GPUs need a lot larger mini-batch size to get their full potential. While for CPU decoding performance stops scaling around mini-batch of 16-24, on GPU decoding in practically scales until you run out of memory. In order to optimise for speed on the GPU, you need to push the <code>workspace</code> to the limits of the device memory, as well as the <code>mini-batch</code> size. For 24 GB GPUs like the 3090, <code>mini-batch: 768</code> and <code>workspace: 18000</code> are a good place to start your binary search.</li>
<li>Shortlists don't improve translation speed on the GPU. Which is great. Just ignore the shortlist</li>
<li>We have experimented with 8bit integer decoding on the GPU, but we failed to get any performance gains compared to just using <code>float16</code> decoding. In order to use this mode to translate, just set <code>fp16: true</code> in the decoder configuration. You should get about 20% speed improvement compared to fp32 decoding, and the ability to use much larger mini-batch size.</li>
</ul>
<p>GPUs are in general really fast, even when compared to decoding on multiple CPUs. Running against the <a href="#minibatchsize">batched</a> student from the previous section:</p>
<pre><code class="language-bash">$ ./batched.translation.sh --cpu-threads 12
### Translating wmt13 en-es on the CPU. Extra flags --cpu-threads 12
[2021-11-22 17:15:25] Total time: 14.46366s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.2,
 ..
</code></pre>
<p>Running on 1 GPU (the <code>-d 0</code> flag species we should run on GPU 0):</p>
<pre><code class="language-bash">$ ./batched.translation.sh -d 0
### Translating wmt13 en-es on the CPU. Extra flags -d 0
[2021-11-22 17:17:12] Total time: 3.61907s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.2,
 ...
</code></pre>
<p>Providing better hyperparameters for the GPU:</p>
<pre><code class="language-bash">$ ./batched.translation.sh -d 0 --workspace 18000 --mini-batch 768 --fp16 --maxi-batch 3000
### Translating wmt13 en-es on the CPU. Extra flags -d 0 --workspace 18000 --mini-batch 768 --fp16 --maxi-batch 3000
[2021-11-22 17:17:51] Total time: 1.44297s wall
{
 &quot;name&quot;: &quot;BLEU&quot;,
 &quot;score&quot;: 35.3,
</code></pre>
<p>A fully optimised GPU is more than 10X faster on this very small example. If we increase the size of the dataset, the GPU will easily be 100X+ faster. In our case the GPU is A100 and the CPU and Ryzen 2 EPYC processor.</p>
<h2 id="advancedquantisationoptions">Advanced quantisation options</h2>
<p>Marian supports two types of integer backends. <a href="https://github.com/pytorch/FBGEMM" target="_blank">fbgemm</a> and <a href="https://github.com/kpu/intgemm" target="blank">intgemm</a>, which deliver different performance depending on the model type and architecture.</p>
<p>Intgemm is hardware agnostic has dedicated codepath for both very old devices (SSSE3) and very new devices (AVX512VNNI). Fbgemm on the other hand only supports AVX2 and AVX512. Intgemm's format is hardware agnostic, whereas in order to use Fbgemm, one needs to know in advance what hardware it is going to run on. <code>marian-conv --help</code> will give you more details.</p>
<p>Finally, both intgemm and fbgemm have int16 format, which is not as fast the the int8 ones, but potentially could work better in some cases where 8bit quantisation damages translation quality too much.</p>
<h2 id="usingspeedorientedforkofmarian">Using speed oriented fork of Marian</h2>
<p>This tutorial describes what can be achieved with marian-dev master alone. There exists however an experimental version of marian focused on speed as part of the <a href="https://browser.mt" target="_blank">bergamot</a> project. How to use it, together with tutorial for creating models can be found on <a href="https://github.com/browsermt/students" target="_blank">github</a>. If you are interested in running a GPU fork with experimenta nvidia patches, you can also find it on <a href="https://github.com/XapaJIaMnu/marian-dev/tree/8bitgpu" target="_blank">github</a>.</p>
<h1 id="researchdirections">Research directions</h1>
<p>In this section we will briefly go over current research directions for efficient MT. This is all bleeding edge stuff that I have seen in papers, but not in practice.</p>
<h2 id="deepencodershallowdecoder">Deep Encoder/Shallow decoder</h2>
<p>The most computationally heavy part of machine translation inference is the decoder, because this is where the autoregressive part of the computation happens, whereas the computation in the encoder happens only once. Based on that it has been <a href="https://arxiv.org/abs/2006.10369" target="_blank">suggested</a> that one can change the standard 6-6 architecture to a 12-1 without a loss of accuracy, but significantly increasing translation speed. You should experiment to discover better student and teacher architectures!</p>
<h2 id="widermodelsnotdeeper">Wider models, not deeper</h2>
<p>Once you get into the domain of very high resource language pairs (50M+ sentences), increasing the number of parameters of your neural network architecture once again becomes relevant. Experience has shown that increasing the <em>width</em> of the model (meaning the dimension of the embeddings/rnn/hidden layer/attention) is more stable compared to increasing the depth of the model. Very deep neural networks sometimes are unable to train at all, but very wide neural networks don't seem to suffer from the same shortcomings. If you have the data, and the compute, you should go wide, not deep!</p>
<h2 id="knnbasedshortlisting">KNN based shortlisting</h2>
<p>As we have shown, lexical shortlisting provides a noticeable gain in inference speed, but it may lead to quality issues. IBM models are bad at capturing excessive subword segmentation or idiom expressions. As a resulting lexical shortlists produced with IBM models favour more literal translations and struggle with cases where there is a big subword atomisation. In order to alleviate this issue, the community is exploring KNN based shortlisting (refer to <a href="https://arxiv.org/abs/1903.03129" target="_blank">this</a> and <a href="https://arxiv.org/abs/1806.00588" target="_blank">this</a>).</p>
<p>Marian already supports this via the option <code>--output-approx-knn</code>, although the feature is still considered experimental. For starters, in order to use this, the model trained must NOT have a bias at the output layer, so the configuration option <code>--output-omit-bias</code> must be specified at training time.</p>
<h2 id="pruning">Pruning</h2>
<p>Training a teacher model, translating the training set, and then training a student model after is very demanding in terms of computational resources. An alternative approach would be to prune the parameters of the teacher model as it is trained, reducing the model size to something of similar magnitude to a student. Unfortunately, so far student models achieve better pareto speed/quality tradeoffs than pruned models, but research is ongoing. Check out existing work on pruning whole <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.116.pdf" target="_blank">models</a> or just <a href="https://aclanthology.org/2020.emnlp-main.211/" target="_blank">attention</a>.</p>
<h1 id="helpfulreading">Helpful reading</h1>
<p>Congratulations on getting this far in the tutorial! That means you are really interested in getting efficient machine translation to work. I will give a list of papers that might be useful starting points for people wanting to get more in depth into the efficient MT work.</p>
<ul>
<li>On knowledge distillation: <a href="https://aclanthology.org/D16-1139/" target="_blank">Sequence-Level Knowledge Distillation</a>.</li>
<li>On training efficient MT systems with marian for the fast machine translation competition, years <a href="https://aclanthology.org/D19-5632/" target="_blank">2019</a>, <a href="https://aclanthology.org/2020.ngt-1.26/" target="_blank">2020</a>, <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.74.pdf" target="_blank">2021</a></li>
<li>On KNN output layer shortlist <a href="https://arxiv.org/abs/1806.00588" target="_blank"> Fast Locality Sensitive Hashing for Beam Search on GPU</a> and <a href="https://aclanthology.org/2022.wmt-1.79/" target="_blank">Revisiting Locality Sensitive Hashing for Vocabulary Selection in Fast Neural Machine Translation</a>.</li>
<li>On pruning: <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.116.pdf" target="_blank">models</a> or <a href="https://aclanthology.org/2020.emnlp-main.211/" target="_blank">attention</a> and again  <a href="https://aclanthology.org/P19-1580/" target="_blank">attention</a>.</li>
<li>On architectures: <a href="https://arxiv.org/abs/2006.10369" target="_blank">Deep encoder, shallow decoder</a>.</li>
<li>On <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.68.pdf" target="_blank">human evaluation of student models</a> and general advice regarding <a href="http://www.statmt.org/wmt21/pdf/2021.wmt-1.57.pdf" target="_blank">automatic and human</a> evaluation.</li>
<li>On <a href="https://aclanthology.org/P16-1009/" target="_blank">backtranslation</a>, <a href="https://aclanthology.org/W19-5206/" target="_blank">tagged backtranslation</a>, backtranslation at <a href="https://aclanthology.org/D18-1045/" target="_blank">scale</a> and <a href="https://arxiv.org/pdf/1903.06059.pdf" target="_blank">output sampling</a> for backatranslation.</li>
<li>The <a href="https://marian-nmt.github.io/examples/mtm2019" target="_blank">previous</a> efficient MT tutorial from the 2019 MT marathon.</li>
</ul>
<p>Thank you everybody for participating in this tutorial! I hope it was helpful! Special thanks <a href="https://scholar.google.com/citations?user=82Uy1aAAAAAJ" target="_blank">Roman Grundkiewicz</a> for proofreading and adding suggestions to the tutorial.</p>
<p>Nick</p>
<div class="myrow">
  <div class="mycolumn">
   <a href="https://unbabel.com" target="_blank">
     <img src="https://nbogoychev.com/content/images/2021/11/unbabel.png" alt="Efficient machine translation" style="width:60%">
    </a>
  </div>
  <div class="mycolumn">
    <a href="https://www.tilde.com" target="_blank">
     <img src="https://nbogoychev.com/content/images/2021/11/tilde.png" alt="Efficient machine translation" style="width:75%">
    </a>
  </div>
  <div class="mycolumn">
    <a href="https://edinburghnlp.inf.ed.ac.uk" target="_blank">
     <img src="https://nbogoychev.com/content/images/2021/11/edinburgh_colour.png" alt="Efficient machine translation" style="width:120%">
    </a>
  </div>
  <div class="mycolumn">
    <a href="https://marian-project.eu" target="_blank">
     <img src="https://nbogoychev.com/content/images/2021/11/marian-cef.png" alt="Efficient machine translation" style="width:100%">
    </a>
  </div>
  <div class="mycolumn">
    <a href="https://ec.europa.eu/inea/en/connecting-europe-facility" target="_blank">
     <img src="https://nbogoychev.com/content/images/2021/11/eu.png" alt="Efficient machine translation" style="width:120%">
    </a>
  </div>
</div>
<!--kg-card-end: markdown--><figure class="kg-card kg-embed-card"><iframe width="356" height="200" src="https://www.youtube.com/embed/mrC3YOW-NQs?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></figure>]]></content:encoded></item><item><title><![CDATA[Not all parameters are born equal! Attention is mostly what you need!]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Greetings fellow NLP-kind. I got a paper in <a href="https://blackboxnlp.github.io" target="_blank">BlackboxNLP 2021</a>, the awesome workshop that aims to shed light on how, what, and why exactly happens inside deep neural networks... So I am going to blog about it.</p>
<h2 id="thepremise">The premise</h2>
<p>The Deep Neural Network is an universal function approximator, that achieves</p>]]></description><link>https://nbogoychev.com/not-all-parameters-are-born-equal-attention-is-mostly-what-you-need/</link><guid isPermaLink="false">6151e2293a50c0063d3350da</guid><category><![CDATA[research]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Thu, 30 Sep 2021 11:40:29 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2021/09/cover.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2021/09/cover.jpg" alt="Not all parameters are born equal! Attention is mostly what you need!"><p>Greetings fellow NLP-kind. I got a paper in <a href="https://blackboxnlp.github.io" target="_blank">BlackboxNLP 2021</a>, the awesome workshop that aims to shed light on how, what, and why exactly happens inside deep neural networks... So I am going to blog about it.</p>
<h2 id="thepremise">The premise</h2>
<p>The Deep Neural Network is an universal function approximator, that achieves surprisingly good results on a variety of tasks, thanks to its staggeringly large number of parameters and the infinite power and wisdom of backpropagation.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2021/09/servers.jpg" alt="Not all parameters are born equal! Attention is mostly what you need!">
<center><font size="-2">Also subject to the availability of copious amounts of GPUs.</font></center>
<p>But is it really necessary to train all of those parameters? Could we just get away with training a small subset of those parameters and achieve similar performance? If we can, indeed, are some parameters more important than others?</p>
<h2 id="thestudy">The study</h2>
<p>We study the value of training neural network parameters, as opposed to initialising them at random and freezing them. We perform a case study using neural machine translation and neural language modeling tasks, and transformers of various sizes and shape as the architecture.</p>
<p>We isolate three separate groups of parameters in a transformer: The embedding layer, the attention layer, and the FFN layer.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2021/09/transformer.png" alt="Not all parameters are born equal! Attention is mostly what you need!">
<center><font size="-2">A simplified overview of a transformer.</font></center>
<p>We perform an ablation study where one or two of the three components are initialised at random, frozen, and never trained afterwards until the neural network converges, and note how much quality has been affected.</p>
<h2 id="thefindings">The findings</h2>
<p>We studied three different transformer presets for neural machine translation: big, base and tiny. In general, we found that bigger transformers have more built-in redundancy and cope better with parts of their parameters being frozen compared to smaller transformers.</p>
<h3 id="attentionismostlywhatyouneed">Attention is mostly what you need.</h3>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2021/09/top_half.png" alt="Not all parameters are born equal! Attention is mostly what you need!">
<center><font size="-2">transformer-big on Turkish-English, one frozen component.</font></center>
<p>We found that when freezing one component of a transformer, the time to convergence increases slightly in terms of number of epochs, and the BLEU scores drop slightly (4%-7%). Preset <strong>(3)</strong>, where we have a frozen and random FFN layer, and trainable embeddings and attention, performs the best, despite only 48% of the parameters being trainable.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2021/09/lower_half.png" alt="Not all parameters are born equal! Attention is mostly what you need!">
<center><font size="-2">transformer-big on Turkish-English, multiple frozen components.</font></center>
<p>When freezing multiple components at once, we found that the best results are achieved by having just the attention be trainable, although just having the FFN trainable produces surprisingly good results as well. The only time where the model completely fails to learn is if we just have the embeddings trainable.</p>
<h3 id="notallparametersarebornequalbuttheyareneverthelessnecessary">Not all parameters are born equal, but they are nevertheless necessary</h3>
<p>Despite the attention and the FFN layer being more or less self-sufficient and much more important than the embedding layer, this doesn't mean we can just remove the embedding layer, or reduce its size. We perform a number of experiments and note that when reducing the size of frozen and random components, the model's performance suffers, even if the trainable components are left untouched. This suggests that the trainable components make use of the randomly initialised transformations and the <strong>sheer number of parameters is more important than whether they are trainable or not</strong>.</p>
<p>In our  <a href="https://arxiv.org/abs/2010.11859" target="_blank">paper</a> we show detailed results in a variety of different combinations of neural network configurations, but the overall trend holds true across all of them.</p>
<h3 id="languagemodelsbehavedifferently">Language models behave differently</h3>
<p>Language models find trainable embeddings much more important for achieving lower perplexity than trainable FFN or attention layer. The drop in perplexity is also a lot more dramatic than the drop in BLEU for translation models.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2021/09/lm.png" alt="Not all parameters are born equal! Attention is mostly what you need!">
<center><font size="-2">Perplexity on an English transformer language model.</font></center>
<p>This suggests that our findings are likely to be task specific.</p>
<h2 id="implications">Implications</h2>
<ul>
<li>We question that vast majority of the transfer learning work that relies on pretrained <em>choose-your-sesame-street-character</em> embeddings for use in downstream tasks. We believe that one should always attempt to solve the downstream task with randomly initialised embeddings before using an off-the-shelf solution in order to truly show the value (or lack thereof) of pretraining.</li>
<li>Could this mean that we could potentially use a RNG in order to generate less important components on the fly during inference for memory efficient networks to be used on embedded devices.</li>
<li>Neural networks are still a blackbox. This particular work is one in a long line of research in <a href="https://en.wikipedia.org/wiki/Echo_state_network" target="_blank">Echo State networks</a>.</li>
</ul>
<p>We have a lot more details, experiments and analysis in the <a href="https://arxiv.org/abs/2010.11859" target="_blank">paper</a>. If interested, please check it out, and come talk to us at the <a href="https://blackboxnlp.github.io" target="_blank">BlackboxNLP 2021</a> poster session!!</p>
<p>Thank you for your time!</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://www.pexels.com/photo/smiling-ethnic-woman-with-blank-poster-in-empty-flat-3758104" target="_blank">pexels</a> <a href="
https://pixabay.com/illustrations/banner-header-attention-caution-1165979/" target="_blank">pixabay</a> <a href="
https://pixabay.com/photos/server-space-the-server-room-dark-2160321/" target="_blank">pixabay</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Failing to do simple domain adaptation for Neural Machine Translation]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>So... The ever increasing number of papers accepted in the field of NLP and Machine Translation makes it virtually impossible for people to keep track of current ongoing research. Therefore, I have taken to writing blog posts about my failings as a researcher...</p>
<p>And, of course the best place to</p>]]></description><link>https://nbogoychev.com/failing-to-do-simple-domain-adaptation-for-neural-machine-translation/</link><guid isPermaLink="false">6149f9a93a50c0063d334ecf</guid><category><![CDATA[research]]></category><dc:creator><![CDATA[Nikolay Bogoychev]]></dc:creator><pubDate>Mon, 27 Sep 2021 15:00:27 GMT</pubDate><media:content url="https://nbogoychev.com/content/images/2021/09/translation_logo.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://nbogoychev.com/content/images/2021/09/translation_logo.jpg" alt="Failing to do simple domain adaptation for Neural Machine Translation"><p>So... The ever increasing number of papers accepted in the field of NLP and Machine Translation makes it virtually impossible for people to keep track of current ongoing research. Therefore, I have taken to writing blog posts about my failings as a researcher...</p>
<p>And, of course the best place to start with is my and <a href="https://pinzhenchen.github.io/" target="_blank">Pinzhen Chen's</a> <a href="https://arxiv.org/abs/2101.00421" target="_blank">paper</a> accepted at the 2021 edition of the <a href="https://insights-workshop.github.io" target="_blank">Workshop on Insights from Negative Results in NLP</a>.</p>
<h2 id="thepremise">The premise</h2>
<p>Machine translation suffers badly from domain mismatch issues. That is to say, when we train our model on text from news articles, we can't reasonably expect that same model to translate well a medicine textbook.</p>
<p>The textbook would have a completely different distribution of the most commonly used words. Some terminology that is not at all common in news articles, would be very common in those textbooks.</p>
<p>Due exposure bias during training, it would be difficult for the <em>news trained</em> model to produce those rare words during decoding. Instead, the model often prefers to hallucinate some common phrase seen in training such as <em>yesterday's football game...</em>.</p>
<h2 id="thebestsolution">The best solution</h2>
<p>Typically, the way to solve this problem reliably is to fine tune your translation model on some high quality parallel in-domain data before putting it out in the wild.</p>
<p>Unfortunately high quality parallel in-domain data is seldom available for the rare domain that you might need. Also, fine tuning has high computational cost.</p>
<h2 id="thefancysolution">The fancy solution</h2>
<p>The fancy solutions of this problem involve training in such a way that you diminish the model's bias towards its training data so that it performs better on out-of-domain datasets.</p>
<p>Unfortunately such methods (minimum risk training, for example) are quite expensive to use in terms of computational cost and significantly complicate the training pipeline. Also, the results are not as good as <strong>The best solution</strong>.</p>
<h2 id="thestupidshortlistoursolution">The stupid shortlist (our) solution</h2>
<p>We decided that the way to solve this problem is to limit the output vocabulary of our translation model, to a pre-selected vocabulary that better matches the domain in question. We do that by using an IBM word alignment model that would hopefully act as a regulariser..</p>
<p>Once we compute the word alignments we can extract a translation <em>shortlist</em> of words. This works sort of like a dictionary translation: When we read the source sentence, we limit the neural network to only produce words for the target sentence that are direct translations, according to the IBM models. In this way we hoped to prevent the neural network from exhibiting strong exposure bias when confused by out of domain text.</p>
<p>The advantage of our method is that it is much cheaper computationally than existing work. And doesn't require in-domain paralle data.</p>
<h2 id="thestupidnbestreorderingoursolution">The stupid n-best reordering (our) solution</h2>
<p>We also decided to approach the problem from a different perspective. Even if the model hallucinates high scoring translations, maybe if we increase the beam size, some of the translations down the line will be more adequate. But how do we define the notion of adequacy?</p>
<p>Well, we made the assumption that all translations in a big <em>n-best</em> list would share some similarity with each other, but the most adequate one, would be the one that is most similar to every other translation. Therefore, we produced a big <em>n-best</em> list and scored every candidate translation against every other translation.</p>
<p>The advantage of this method is that it requires no in-domain data, but is somewhat slow during translation.</p>
<h2 id="results">Results</h2>
<p>The <strong>shortlist</strong> solution showed some promise, delivering several BLEU points increase in a constrained low resource setting, however it turned out that when the domain mismatch was too great, or the setting wasn't low resource, our method showed no improvement.</p>
<p>The <strong>reordering</strong> method showed consistent improvement in BLEU score, but upon shorter examination it turned out that it was preying on BLEU's length penalty: Our method accidentally acted as a regulariser that favoured shorter translations, not more adequate ones.</p>
<img align="middle" width="70%" src="https://nbogoychev.com/content/images/2021/09/results.jpg" alt="Failing to do simple domain adaptation for Neural Machine Translation">
<center><font size="-2">Yeah, we don't have any results here, sorry just excuses.</font></center>
<h2 id="whydidwefail">Why did we fail?</h2>
<p>The main reason why our methods fail is, in short, vocabulary mismatch.</p>
<p>In machine translation we normally split rare words into subwords (commonly known as byte pair encoding). When we have a domain mismatched scenario, we split words very often, to the point that individual tokens become nonsensical as shown on this table.</p>
<table>
<thead>
<tr>
<th style="text-align:center">German</th>
<th style="text-align:center">English</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">sein Pilot hat nicht die volle Kontrolle .</td>
<td style="text-align:center">its p@@ il@@ ot is@@ n’t in control .</td>
</tr>
<tr>
<td style="text-align:center">und Z@@ eth@@ rid ? nur einen Strei@@ f@@ sch@@ uss .</td>
<td style="text-align:center">and , Z@@ eth@@ rid , just gr@@ aze it .</td>
</tr>
</tbody>
</table>
<p>When dealing with out of domain data, we are virtually guaranteed that we are going to encounter words that the model has either seen very few times, or not at all, therefore a lot of subword splitting is warranted. In our experiments with large domain mismatch, the average sentence length after applying subword splitting nearly <strong>doubled</strong>, meaning that our lexical shortlist obtained using IBM methods would not be able to offer much meaningful information.</p>
<img align="middle" width="60%" src="https://nbogoychev.com/content/images/2021/09/pexels-cottonbro-7703661.jpg" alt="Failing to do simple domain adaptation for Neural Machine Translation">
<center><font size="-2">The idea was worth it, but it really doesn't work.</font></center>
<p>If you want to learn more about it, go read our Insights from Negative Results in NLP 2021 <a href="https://arxiv.org/abs/2101.00421" target="_blank">paper</a>.</p>
<p>See ya at the workshop!</p>
<p>Nick</p>
<p><font size="-4"><center>Image sources: <a href="https://pixabay.com/photos/key-old-flower-nostalgic-vintage-5105878/2" target="_blank">pixabay</a> <a href="
https://pixabay.com/illustrations/result-excuse-me-fail-inability-to-3236280/" target="_blank">pixabay</a> <a href="
https://www.pexels.com/photo/dawn-fashion-people-woman-7703661/" target="_blank">pexels</a> </center></font></p>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>