Linguistic Purism and Software Localization

People who translate MediaWiki and other pieces of software with which I am involved to languages of India, Philippines, African countries, and other places, often ask me: Should we translate to pure native neologisms, or reuse English words?

Linguistic purism is quite common in translation in general, and not only in software. I heard of a study that compared a corpus of texts written originally in Hebrew with a corpus of translated texts. The translated texts had considerably fewer loanwords. This may be surprising at first: how can translated texts have fewer loanwords?

But actually this makes sense: A translator is not creating a story, but re-telling a story that already exists. A good translator is supposed to understand the original story well first, and after understanding it, the translator is supposed to retell it in the target language. Many translators use the time that they don’t invest in creating the story itself to make the translation “purer” than the target language actually is.

A text that is originally written in Hebrew expresses how Hebrew-speaking people actually talk. Despite a history of creating many neologisms, some of which were successful, Hebrew speakers also use a lot of loanwords from English, Arabic, German, Russian, French, and other languages.

And that’s totally OK. Loanwords don’t “degrade” or “kill” a language, as some people say. Certainly not by themselves. English has many, many words from French, Latin, Norwegian, Greek, and other languages, and they didn’t kill it. Quite the contrary: English is one of the most successful languages in the world.

A good original writer creates verisimilitude, naturally or intentionally, by using actual living language. And actual living language has loan words. More in some languages, fewer in others, but it happens everywhere.

Software localization is a bit different from books, however. Books, both fiction and non-fiction, are art. At least to some degree, the language itself is an expressive tool there.

Software user interfaces are less about art and more about function. A piece of software is supposed to be usable and functional, and as easy and obvious to learn and use as possible. The less people need to learn it, the closer it is to perfection. And localized software is no different: it must, above all, be functional.

Everything else is secondary to usability. If the translation is beautiful, but the software can’t be used, the job is not done.

And this is the thing that is supposed to guide you when choosing whether to translate a term as a native word, possibly a neologism, or to transliterate a foreign term and make it a loanword: Will the user understand it and be able to use the software?

The choice is not as obvious as some people may think, however. Some people may think that loaning a word makes more sense because it’s already familiar, and this will make the software familiar.

But always ask yourself: Familiar to whom?

The translator, pretty much by definition, is supposed to know the source language, and to be able to use the software in the source language. Most often the source language is English. So most likely the translator is familiar with the source terminology.

But will the user be familiar with it?

The translated piece of software is supposed to be usable by people who aren’t familiar with that software in the source language, and, very importantly, by people who don’t know the source language at all.

So if you translate words like “log in”, “account”, “file”, “proxy”, “feed”, and so on, by simply transliterating them into the target language because they are “familiar” in this form, ask yourself: are they also familiar to people who don’t know English and who aren’t experienced with using computers?

Some Hebrew software localizers translate “proxy” as something like “intermediary server” (שרת מתווך), and some just write “proxy” in transliteration (פרוקסי). The rationale for “proxy” is usually this: “everyone knows what ‘proxy’ is, and no one knows what an intermediary server is”.

But is it really everyone? Or maybe it’s just you and your software developer friends?

To people who aren’t software developers, the function of “proxy” is pretty much as obscure as the function of “intermediary server”… or is it? Because the fully translated native term actually says something about what this thing does in familiar words.

Of course, if you are really sure that a foreign term is widely familiar to all people, then it’s OK to use, and often it’s better than using a “pure” neologism.

And that’s why I put “pure” in double quotes: The “purity” itself is not important. Functionality and usability are above all. Sometimes “purity” makes usability better. Sometimes it doesn’t. It’s not an exact science.

I’ll go even further: More often than many people would think, pondering the meaning and choosing a correct translation for a software user interface term may start fruitful conversations about the design of the original software and uncover usability flaws that affect everyone, including the people who use the software in English.

There are thousands of useful bug reports in MediaWiki and in other software projects in which I am involved that were filed by localizers who had translation difficulties. Many of these bugs were fixed, improving the experience for users in English and in all languages.

To sum up:

  • Purism shouldn’t be the most important goal, but it should be one of the optional tools that the translator uses.
  • Purism is neither absolutely bad nor absolutely good. It can be useful when used wisely in context, case by case.
  • Usability should be the most important goal of software localization.
  • Usability means usability for all, not just for your colleagues.
  • Localizers can improve the whole software in more ways than just translating strings.

Amir Aharoni’s Quasi-Pro Tips for Translating the Software That Powers Wikipedia, 2020 Edition

This is a new version of a post that was originally published in 2015. Much of it is the same, but there are several updates that justified publishing a new version.

Introduction

As you probably already know, Wikipedia is a website. A website has two components: the content and the user interface. The content of Wikipedia is the articles, as well as various discussion and help pages. The user interface is the menus around the articles and the various screens that let editors edit the articles and communicate to each other.

Another thing that you probably already know is that Wikipedia is massively multilingual, so both the content and the user interface must be translated.

Translation of articles is a topic for another post. This post is about getting all the user interface translated to your language, and doing it as quickly, easily, and efficiently as possible.

The most important piece of software that powers Wikipedia and its sister projects is called MediaWiki. As of today, there are more than 3,800 messages to translate in MediaWiki, and the number grows frequently. “Messages” in the MediaWiki jargon are strings that are shown in the user interface. Every message can and should be translated.

In addition to core MediaWiki, Wikipedia also uses many MediaWiki extensions. Some of them are very important because they are frequently seen by a lot of readers and editors. For example, these are extensions for displaying citations and mathematical formulas, uploading files, receiving notifications, mobile browsing, different editing environments, etc. There are more than 5,000 messages to translate in the main extensions, and over 18,000 messages to translate if you want to have all the extensions translated, including the most technical ones. There are also the Wikipedia mobile apps and additional tools for making automated edits (bots) and monitoring vandalism, with several hundreds of messages each.

Translating all of it probably sounds like an impossibly enormous job. It indeed takes time and effort, but the good news are that there are languages into which all of this was translated completely, and it can also be completely translated into yours. You can do it. In this post I’ll show you how.

A personal story

In early 2011 I completed the translation of all the messages that are needed for Wikipedia and projects related to it into Hebrew. All. The total, complete, no-excuses, premium Wikipedia experience, in Hebrew. Every single part of the MediaWiki software, extensions and additional tools was translated to Hebrew. Since then, if you can read Hebrew, you don’t need to know a single English word to use it.

I didn’t do it alone, of course. There were plenty of other people who did this before I joined the effort, and plenty of others who helped along the way: Rotem Dan, Ofra Hod, Yaron Shahrabani, Rotem Liss, Or Shapiro, Shani Evenshtein, Dagesh Hazak, Guycn2 and Inkbug (I don’t know the real names of the last three), and many others. But back then in 2011 it was I who made a conscious effort to get to 100%. It took me quite a few weeks, but I made it.

However, the software that powers Wikipedia changes every single day. So the day after the translations statistics got to 100%, they went down to 99%, because new messages to translate were added. But there were just a few of them, and it took me only a few minutes to translate them and get back to 100%.

I’ve been doing this almost every day since then, keeping Hebrew at 100%. Sometimes it slips because I am traveling or because I am ill. It slipped for quite a few months in 2014 because my first child was born and a lot of new messages happened to be added at about the same time, but Hebrew got back to 100%. It happened again in 2018 for the same happy reason, and went back to 100% after a few months. And I keep doing this.

With the sincere hope that this will be useful for helping you translate the software that powers Wikipedia completely to your language, let me tell you how.

Preparation

First, let’s do some work to set you up.

If you haven’t already, create a translatewiki.net account at the translatewiki.net main page. First, select the languages you know by clicking the “Choose another language” button (if the language into which you want to translate doesn’t appear in the list, choose some other language you know, or contact me). After selecting your language, enter your account details. This account is separate from your Wikipedia account, so if you already have a Wikipedia account, you need to create a new one. It may be a good idea to give it the same username.

After creating the account you have to make several test translations to get full translator permissions. This may take a few hours. Everybody except vandals and spammers gets full translator permissions, so if for some reason you aren’t getting them or if it appears to take too much time, please contact me.

Make sure you know your ISO 639 language code. You can easily find it on Wikipedia.

Go to your preferences, to the Editing tab, and add languages that you know to Assistant languages. For example, if you speak one of the native languages of South America like Aymara (ay) or Quechua (qu), then you probably also know Spanish (es) or Portuguese (pt), and if you speak one of the languages of Indonesia like Javanese (jv) or Balinese (ban), then you probably also know Indonesian (id). When available, translations to these languages will be shown in addition to English.

Familiarize yourself with the Support page and with the general localization guidelines for MediaWiki.

Add yourself to the portal for your language. The page name is Portal:Xyz, where Xyz is your language code.

Priorities, part 1

The translatewiki.net website hosts many projects to translate beyond stuff related to Wikipedia. It hosts such respectable Free Software projects as OpenStreetMap, Etherpad, MathJax, Blockly, and others. Also, not all the MediaWiki extensions are used on Wikimedia projects. There are plenty of extensions, with thousands of translatable messages, that are not used by Wikimedia, but only on other sites, but they use translatewiki.net as the platform for translation of their user interface.

It would be nice to translate all of it, but because I don’t have time for that, I have to prioritize.

On my translatewiki.net user page I have a list of direct links to the translation interface of the projects that are the most important:

  • Core MediaWiki: the heart of it all
  • Extensions used by Wikimedia: the extensions on Wikipedia and related sites. This group is huge, and I prioritize it further; see below.
  • MediaWiki Action Api: the documentation of the API functions, mostly interesting to developers who build tools around Wikimedia projects
  • Wikipedia Android app
  • Wikipedia iOS app
  • Installer: MediaWiki’s installer, not used on Wikipedia because MediaWiki is already installed there, but useful for people who install their own instances of MediaWiki, in particular new developers
  • Intuition: a set of tools, like edit counters, statistics collectors, etc.
  • Pywikibot: a library for writing bots—scripts that make useful automatic edits to MediaWiki sites.

I usually don’t work on translating other projects unless all the above projects are 100% translated to Hebrew. I occasionally make an exception for OpenStreetMap or Etherpad, but only if there’s little to translate there and the untranslated MediaWiki-related projects are not very important.

Priorities, part 2

So how can you know what is important among more than 18,000 messages from the Wikimedia universe?

Start from MediaWiki most important messages. If your language is not at 100% in this list, it absolutely must be. This list is automatically created periodically by counting which 600 or so messages are actually shown most frequently to Wikipedia users. This list includes messages from MediaWiki core and a bunch of extensions, so when you’re done with it, you’ll see that the statistics for several groups improved by themselves.

Now, if the translation of MediaWiki core to your language is not yet at 18%, get it there. Why 18%? Because that’s the threshold for exporting your language to the source code. This is essential for making it possible to use your language in your Wikipedia (or Incubator). It will be quite easy to find short and simple messages to translate (of course, you still have to do it carefully and correctly).

Some technical notes

Have you read the general localization guide for Mediawiki? Read it again, and make sure you understand it. If you don’t, ask for help! The most important section, especially for new translators, is “Translation notes”.

A super-brief list of things that you should know:

  • Many messages use symbols such as ==, ===, [[]], {{}}, *, #, and so on. This is wiki syntax, also known as “wikitext” or “wiki markup”. It is recommended to become familiar with some wiki syntax by editing a few pages on another wiki site, such as Wikipedia, before translating MediaWiki messages at translatewiki.
  • “[[Special:Homepage]]” adds a link to the page “Special:Homepage”. “[[Special:Homepage|Homepage]]” adds a link to the page “Special:Homepage”, but it will be displayed as “Homepage”. In such cases, you are usually not supposed to translate the text before the | (pipe), but you should translate the text after it. For example, in Russian: “[[Special:Homepage|Домашняя страница]]”. When in doubt, check the documentation in the sidebar.
  • $1, $2, $3: These are known as parameters, placeholders, or variables. They are replaced in run time, usually by numbers of names. Copy them as they are, and put them in the right place in the sentence, where it is right for your language. Always check the documentation in the sidebar to understand with what will they be replaced.
  • If you see something like “$1 {{PLURAL:$1|page|pages}}” in a translatable message, this means that the word will be shown according to the value of the variable $1. Note that you must not change the “PLURAL:$1” part, but you must translate the “page|pages” part.
  • If you see something else in curly brackets, it’s probably a “magic word”. Check the documentation to understand it. You usually don’t translate the thing in the beginning, such as {{SITENAME, {{GENDER, etc., but you sometimes need to translate things towards the end. See the localization guide for full documentation!

Learn to use the project selector at the top of the translation interface. Projects are also known as “Message groups”. For example, each extension is a message group, and some larger extension, such as Visual Editor, are further divided into several smaller message groups. Using the selector is very simple: Just click “All” next to “Message group”, and use the search box to find the component that you want to translate, such as “Visual Editor” or “Wikibase”. Clicking on a message group will load the untranslated messages for that group.

The “Extensions used by Wikimedia” group is divided into several more subgroups. The important one is “Extensions used by Wikimedia – Main”, which includes the most commonly used extensions. Other subgroups are:

  • “Advanced”: extensions that are used only on some wikis, or are useful only to administrators and other advanced users. This should be the first subgroup you translate after you complete the “Main” subgroup.
  • “Fundraising”: extensions used for collecting donations for the Wikimedia Foundation.
  • “Legacy”: extensions that are still installed on Wikimedia sites, but are going to be removed. You can most likely skip this subgroup completely.
  • “Media” includes advanced tools for media files curating and uploading, especially on Wikimedia Commons.
  • “Technical”: this is mostly API documentation for various extensions, which is shown on the ApiHelp and ApiSandbox special pages. It is very useful for developers of gadgets, bots, and other software, but not necessary for other users. This group also includes several other very advanced extensions that are used only by a few people. You should translate these messages some day, but it’s OK to do it later.
  • “Upcoming”: these are extensions that are not yet widely installed on Wikimedia sites, but are going to be installed soon. Translating them is a pretty good idea, because they are usually very new, and may include some confusing messages. The earlier you report these confusing messages to the developers, the better!
  • “Wikivoyage”: extensions used only on Wikivoyage sites. Translate them if there is a Wikivoyage site in your language, or if you want to start one.

There is also a group called “EXIF Tags”. It’s an advanced part of core MediaWiki. It mostly includes advanced photography terminology, and it shows information about photographs on Wikimedia Commons. If you are not sure how to translate these messages, ask a professional photographer. In any case, it’s OK to do it later, after you completed more important components.

Getting things done, one by one

Once you have the most important MediaWiki messages 100% and at least 18% of MediaWiki core is translated to your language, where do you go next?

I have surprising advice.

You need to get everything to 100% eventually. There are several ways to get there. Your mileage may vary, but I’m going to suggest the way that worked for me: Complete the easiest piece that will get your language closer to 100%! For me this is an easy way to remove an item off my list and feel that I accomplished something.

But still, there are so many items at which you could start looking! So here’s my selection of components that are more user-visible and less technical. The list is ordered not by importance, but by the number of messages to translate (as of October 2020):

  • Vector: the default skin for desktop and laptop computers
  • Minerva Neue: the skin for mobile phones and tablets
  • Babel: for displaying boxes on user pages with information about the languages that the user knows
  • Discussion Tools: for making the use of talk pages easier
  • Thanks: the extension for sending “thank you” messages to other editors
  • Universal Language Selector: the extension that lets people easily select the language they need from a long list of languages (disclaimer: I am one of its developers)
  • jquery.uls: an internal component of Universal Language Selector that has to be translated separately (for technical reasons)
  • Cite: the extension that displays footnotes on Wikipedia
  • Math: the extension that displays math formulas in articles
  • Wikibase Client: the part of Wikidata that appears on Wikipedia, mostly for handling interlanguage links
  • ProofreadPage: the extension that makes it easy to digitize PDF and DjVu files on Wikisource (this is relevant only if there is a Wikisource site in your language, or if you plan to start one)
  • Wikibase Lib: additional messages for Wikidata
  • WikiEditor: the toolbar for the wiki syntax editor
  • Echo: the extension that shows notifications about messages and events (the red numbers at the top of Wikipedia)
  • MobileFrontend: the extension that adapts MediaWiki to mobile phones
  • ContentTranslation: the extension that helps to translate Wikipedia articles between languages (disclaimer: I am one of its developers)
  • UploadWizard: the extension that helps people upload files to Wikimedia Commons comfortably
  • Translate: the extension that powers translatewiki.net itself (disclaimer: I am one of its developers)
  • Page Translation: the component of the Translate extension that helps to translate wiki pages (other than Wikipedia articles)
  • Wikibase Repo: the extension that powers the Wikidata website
  • VisualEditor: the extension that allows Wikipedia articles to be edited in a WYSIWYG style
  • Wikipedia Android mobile app
  • Wikipedia iOS mobile app
  • Wikipedia KaiOS mobile app
  • MediaWiki core: the base MediaWiki software itself!

I put MediaWiki core last intentionally. It’s a very large message group, with over 3000 messages. It’s hard to get it completed quickly, and actually, some of its features are not seen very frequently by users who aren’t site administrators or very advanced editors. By all means, do complete it, try to do it as early as possible, and get your friends to help you, but it’s also OK if it takes some time.

Getting all the things done

OK, so if you translate all the items above, you’ll make Wikipedia in your language mostly usable for most readers and editors. But let’s go further.

Let’s go further not just for the sake of seeing pure 100% in the statistics everywhere. There’s more.

As I wrote above, the software changes every single day. So do the translatable messages. You need to get your language to 100% not just once; you need to keep doing it continuously.

Once you make the effort of getting to 100%, it will be much easier to keep it there. This means translating some things that are used rarely (but used nevertheless; otherwise they’d be removed). This means investing a few more days or weeks into translating-translating-translating.

You’ll be able to congratulate yourself not only upon the big accomplishment of getting everything to 100%, but also upon the accomplishments along the way.

One strategy to accomplish this is translating extension by extension. This means, going to your translatewiki.net language statistics: here’s an example with Albanian, but choose your own language. Click “expand” on MediaWiki, then again “expand” on “MediaWiki Extensions” (this may take a few seconds—there are lots of them!), then on “Extensions used by Wikimedia” and finally, on “Extensions used by Wikimedia – Main”. Similarly to what I described above, find the smaller extensions first and translate them. Once you’re done with all the Main extensions, do all the extensions used by Wikimedia. This strategy can work well if you have several people translating to your language, because it’s easy to divide work by topic. (Going to all extensions, beyond Extensions used by Wikimedia, helps users of these extensions, but doesn’t help Wikipedia very much.)

Another fun strategy is quiet and friendly competition with other languages. Open the statistics for Extensions Used by Wikimedia – Main and sort the table by the “Completion” column. Find your language. Now translate as many messages as needed to pass the language above you in the list. Then translate as many messages as needed to pass the next language above you in the list. Repeat until you get to 100%.

For example, here’s an excerpt from the statistics for today:

Let’s say that you are translating to Georgian. You only need to translate 37 messages to pass Marathi and go up a notch (2555 – 2519 + 1 = 37). Then 56 messages more to pass Hindi and go up one more notch (2518 – 2463 + 1 = 56). And so on.

Once you’re done, you will have translated over 5600 messages, but it’s much easier to do it in small steps.

Once you get to 100% in the main extensions, do the same with all the Extensions Used by Wikimedia. It’s way over 10,000 messages, but the same strategies work.

Good stuff to do along the way

Invite your friends! You don’t have to do it alone. Friends will help you work more quickly and find translations to difficult words.

Never assume that the English message is perfect. Never. Do what you can to improve the English messages. Developers are people just like you are. There are developers who know their code very well, but who are not the best writers. And though some messages are written by professional user experience designers, many are written by the developers themselves. Developers are developers; they are not necessarily very good writers or designers, and the messages that they write in English may not be perfect. Also, keep in mind that many, many MediaWiki developers are not native English speakers; a lot of them are from Russia, Netherlands, India, Spain, Germany, Norway, China, France and many other countries. English is foreign to them, and they may make mistakes.

So if anything is hard to translate, of if there are any other problems with the English messages to the translatewiki Support page. While you are there, use the opportunity to help other translators who are asking questions there, if you can.

Another good thing is to do your best to try using the software that you are translating. If there are thousands of messages that are not translated to your language, then chances are that it’s already deployed in Wikipedia and you can try it. Actually trying to use it will help you translate it better.

Whenever relevant, fix the documentation displayed near the translation area. Strange as it may sound, it is possible that you understand the message better than the developer who wrote it!

Before translating a component, review the messages that were already translated. To do this, click the “All” tab at the top of the translation area. It’s useful for learning the current terminology, and you can also improve them and make them more consistent.

After you gain some experience, create or improve a localization guide in your language. There are very few of them at the moment, and there should be more. Here’s the localization guide for French, for example. Create your own with the title “Localisation guidelines/xyz” where “xyz” is your language code.

As in Wikipedia itself, Be Bold.

OK, so I got to 100%, what now?

Well done and congratulations.

Now check the statistics for your language every day. I can’t emphasize enough how important it is to do this every day. If not every day, then as frequently as you can.

The way I do this is having a list of links on my translatewiki.net user page. I click them every day, and if there’s anything new to translate, I immediately translate it. Usually there are just a few new messages to translate; I didn’t measure precisely, but usually it’s fewer than 20. Quite often you won’t have to translate from scratch, but to update the translation of a message that changed in English, which is usually even faster.

But what if you suddenly see 200 new messages to translate or more? It happens occasionally. Maybe several times a year, when a major new feature is added or an existing feature is changed. Basically, handle it the same way you got to 100% before: step by step, part by part, day by day, week by week, notch by notch, and get back to 100%.

But you can also try to anticipate it. Follow the discussions about new features, check out new extensions that appear before they are added to the Extensions Used by Wikimedia group, consider translating them when you have a few spare minutes. At the worst case, they will never be used by Wikimedia, but they may be used by somebody else who speaks your language, and your translations will definitely feed the translation memory database that helps you and other people translate more efficiently and easily.

Consider also translating other useful projects: OpenStreetMap, Etherpad, Blockly, Encyclopedia of Life, etc. Up to you. The same techniques apply everywhere.

What do I get for doing all this work?

The knowledge that thanks to you, people who read in your language can use Wikipedia without having to learn English. Awesome, isn’t it? Some people call it “Good karma”. Also, the knowledge that you are responsible for creating and spreading the terminology in your language for one of the most important and popular websites in the world.

Oh, and you also get enormous experience with software localization, which is a rather useful and demanded job skill these days.

Is there any other way in which I can help?

Yes!

If you find this post useful, please translate it to other languages and publish it in your blog. No copyright restrictions, public domain (but it would be nice if you credit me and send me a link to your translation). Make any adaptations you need for your language. It took me years of experience to learn all of this, and it took me about four hours to write it. Translating it will take you much less than four hours, and it will help people be more efficient translators.

Thanks!

Disease of Familiarity, the Flaw of Wikipedia

Originally written as an answer to the question What are some major flaws in Wikipedia? on Quora. Republished here with some changes.

Wikipedia has a whole lot of flaws, and its basic meta-flaw is the disease of familiarity.

It does not mean what you think it means. The disease of familiarity is knowing so much about something that you don’t understand what it is like to not understand it.

I recognized this phenomenon in 2011 or so, and called it The Software Localization Paradox. I later realized that it has a lot of other aspects beyond software localization, so I thought a lot about it and struggled for years with giving it a name. I learned about the term “disease of familiarity” from Richard Saul Wurman, best known as the creator of the TED conference (see a note about it at the end of this post). Some other names for this phenomenon are “curse of knowledge” and “mind blindness”. See also Is there a name for “knowing so much about something that you don’t understand what is it like not to know it”?

Unfortunately, none of these terms is very famous, and their meaning is not obvious without some explanation. What’s even worse, the phenomenon is in general hard to explain because of its very nature. But I’ll try to give a few examples.


Wikipedia doesn’t make it easy for people to understand its jargon.

Wikipedia calls itself “The Free encyclopedia”; what does it mean that it’s “free”? I wrote Wikipedia:The Free Encyclopedia, one of the essays on this topic (there are others), but it’s not official or authoritative, and more importantly, the fact that this essay exists doesn’t mean that everybody who starts writing for Wikipedia reads it and understands the ideology behind it, and its implications. An important implication of this ideology is that according to the ideology of the Free Culture movement, of which Wikipedia is a part, is that some images and pieces of text can be copied from other sites into Wikipedia, and some cannot. The main reason for this is copyright law. People often copy text or images that are not compatible with the policies, and since this is heavily enforced by experienced Wikipedia editors, this causes misunderstandings. Wikipedia’s interface could communicate these policies better, but experienced Wikipedians, who already know them, rarely think about this problem. Disease of familiarity.

Wikipedia calls itself “a wiki”. A lot of people think that it’s just a meaningless catchy brand name, like “Kodak”. Some others think that it refers to the markup language in which the site is written. Yet others think that it’s an acronym that means “what I know is”. None of these interpretations is correct. The actual meaning of “wiki” is “a website that anyone can edit”. The people who are experienced with editing Wikipedia know this, and assume that everybody else does, but the truth is that a lot of new people don’t understand it and are afraid of editing pages that others had written, or freak out when somebody edits what they had written. Disease of familiarity.

The most common, built-in way for communication between the different Wikipedians is the talk page. Only Wikipedia and other sites that use the MediaWiki software use the term “talk page”. Other sites call such a thing “forum”, “comments”, or “discussion”. (To make things more confusing, Wikipedia itself occasionally calls it “discussion”.) Furthermore, talk pages, which started on Wikipedia in 2001, before commenting systems like Disqus, phpBB, Facebook, or Reddit were common, work in a very weird way: you need to manually indent each of your posts, you need to manually sign your name, and you need to use a lot of obscure markup and templates (“what are templates?!”, every new user must wonder). Experienced editors are so accustomed to doing this that they assume that everybody knows this. Disease of familiarity.

A lot of pages in Wikipedia in English and in many other languages have infoboxes. For example, in articles about cities and towns there’s an infobox that shows a photo, the name of the mayor, the population, etc. When you’re writing an article about your town, you’ll want to insert an infobox. Which button do you use to do this? There’s no “Infobox” button, and even if there were, you wouldn’t know that you need to look for it because “Infobox” is a word in Wikipedia’s internal jargon. What you actually have to do is Insert → Template → type “Infobox settlement”, and fill a form. Every step here is non-intuitive, especially the part where you have to type the template’s name. Where are you supposed to know it from? Also, these steps are how it works on the English Wikipedia, and in other languages it works differently. Disease of familiarity.

And this brings us to the next big topic: Language.

You see, when I talk about Wikipedia, I talk about Wikipedia in all languages at once. Otherwise, I talk about the English Wikipedia, the Japanese Wikipedia, the Arabic Wikipedia, and so on. Most people are not like me: when they talk about Wikipedia, they talk about the one in the language in which they read most often. Quite often it’s not their first language; for example, a whole lot of people read the Wikipedia in English even though English is their second language and they don’t even know that there is a Wikipedia in their own language. When these people say “Wikipedia” they actually mean “the English Wikipedia”.

There’s nothing bad in it by itself. It’s usually natural to read in a language that you know best and not to care very much about other languages.

But here’s where it gets complicated: Technically, there are editions of Wikipedia in about 300 languages. This number is pretty meaningless, however: There are about 7,000 languages in the world, so not the whole world is covered, and only in 100 languages or so there is a Wikipedia in which there is actually some continuous writing activity. In the other 200 the activity is only sporadic, or there is no activity at all—somebody just started writing something in that language, and a domain was created, but then the first people who started it lost interest and nobody else came to continue their work.

This is pretty sad because it’s frequently forgotten that a whole lot of people cannot read what they want in Wikipedia because they don’t know a language in which there is an article about what they want to learn. If you are reading this post, you have the privilege of knowing English, and it’s hard for you to imagine how does a person who doesn’t know English feel. Disease of familiarity: You think you can tell everybody “if you want to know something, read about it in Wikipedia”, but you cannot actually tell this to most people because most people don’t know English.

The missed opportunity becomes even more horrific when you realize that the people who would have the most appropriate skills for breaking out of this paradox are the people who are least likely to notice it, and the people who are hurt by it the most are the least capable of fixing it themselves. Think about it:

  • If you know, for example, Russian and English, and you need to read about a topic on which there is an article in the English Wikipedia, but not in Russian, you can read the English Wikipedia, and it’s possible that you won’t even notice that an article in Russian doesn’t exist. Unless you exercise mindfulness about the issue, you won’t empathize with people who don’t know English. To break out of this cycle, one can practice the following:
    • Always look for articles in Russian first.
    • Dedicate some time every week to translating articles. (See How does Wikipedia handle page translation?)
    • When you talk to people in your language, don’t assume that they know English.
  • A person who doesn’t know English is just stuck without an article, and there’s not much to do. It’s possible that you don’t even know that the article you need exists in another language. And maybe you cannot even read the user manual that teaches you how to edit. What can you do?
    • Try to be bold and ask your friends who do know English to translate it for you and publish the translation for the benefit of all the people who speak your language.
    • (Of course, there’s the solution of learning English, but we can’t assume that it works. Evidently, there are billions of people who don’t know English, and they won’t all learn English any time soon.)

(In case it isn’t clear, you can replace “English” and “Russian” in the example above with any other pair of languages.)

It’s particularly painful in countries where English, French, or Portuguese is the dominant language of government and education, even though a lot of the people, often the majority, don’t actually know it. This is true for many countries in Africa, as well as for Philippines, and to a certain extent also in India and Pakistan.

People who know English have a very useful aid for their school studies in the form of Wikipedia. People who don’t know English are left behind: the teachers don’t have Wikipedia to get help with planning the lessons and the students don’t have Wikipedia to get help with homework. The people who know English and study in English-medium schools have these things and don’t even notice how the other people—often their friends!—are left behind. Disease of familiarity.

Finally, most of the people who write in the 70 or so most successful Wikipedias don’t quite realize that the reason the Wikipedia in their language is successful is that before they had a Wikipedia, they had had another printed or digital encyclopedia, possibly more than one; and they had public libraries, and schools, and universities, and all those other things, which allowed them to imagine quite easily how would a free encyclopedia look like. A lot of languages have never had these things, and a Wikipedia would be the first major collection of educational materials in them. This would be pretty awesome, but this develops very slowly. People who write in the successful Wikipedia projects don’t realize that they just had to take the same concepts they already knew well and rebuild them in cyberspace, without having to jump through any conceptual epistemological hoops.

Disease of familiarity.


It’s hard to explain this.

I unfortunately suspect that very few, if any, people will understand this boring, long, and conceptually difficult post. If you disagree, please comment. If you think that you understand what I’m trying to say, but you have a simpler or shorter way to say it, please comment or suggest an edit (and tell your friends). If you have more examples of the disease of familiarity in Wikipedia and elsewhere, please speak up.

Thank you.


(As promised above, a note about Richard Saul Wurman. I heard him introduce the “disease of familiarity” concept in an interview with Debbie Millman on her podcast Design Matters, at about 23 minutes in. That interview was one of this podcast’s weirdest episodes: you can clearly hear that he’s making Millman uncomfortable, and she also mentioned it on Twitter. This, in turn, makes me uncomfortable to discuss something I learned from that interview, but I am just unable to find any better terminology for the phenomenon in question. If you have suggestions, please send them my way.)


Disclaimer: I’m a contractor working with the Wikimedia Foundation, but this post, as well as all my other posts on the topic of Wikimedia, Wikipedia, and related projects, are my own opinions and do not represent the Wikimedia Foundation.

The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.

Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:

Emmanuel Macron on Twitter.png

The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.

Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.

Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:

Еўрарадыё   euroradio    Twitter double.pngBoth tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.

And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!

An even stranger thing happens when you actually try to translate it:

Еўрарадыё   euroradio    Twitter single Russian.pngNotice two weird things here:

  1. After clicking, “Ukrainian” turned into “Russian”!
  2. Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.

Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:

 4  Tweets with replies by Ntụ Agbasa   blossomozurumba    Twitter.png

This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.

The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.

If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and a word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.

How could this be resolved?

Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?

In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.

Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.

If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.

Five More Privileges of English Speakers, part 2: Language and Software

For the previous part in the series, see Five Privileges of English Speakers, part 1.

I’m continuing the series of posts in each of which I write about five privileges that English speakers have without giving it a lot of thought. The examples I give mostly come from my experience translating software, Wikipedia articles, blog posts, and some other texts between English, Hebrew, and Russian. Hebrew and Russian are the languages I know best. If you have interesting examples from other languages, I am very interested in hearing them and writing about them.

I’m writing them mostly as they come into my mind, without a particular order, but the five items in this part of the series will focus on usage of the English language in software, and try to show that the dominance of English is not only a consequence of economics and history, but that it’s further reinforced by features of the language itself.

1. Software usually begins its life in English

English is the main language of software development worldwide.

The world’s best-known place for software development is Silicon Valley, an English-speaking place. That’s the place of Facebook, Google, Apple, Oracle and many others. California is also the home of Adobe.

There are several other hubs of software development in United States: Seattle (Microsoft, Amazon), North Carolina (Red Hat), New York (IBM, CA), Massachusets (TripAdvisor, Lotus, RSA), and more. The U.S. is also the source for much of computer science research and education, coming from Berkeley, MIT, and plenty of other schools. The U.S. is also the birthplace of the Internet, originally supported by the U.S. Department of Defense and several American universities. The world wide web, which brought the Internet to the masses, was created in Switzerland by an English speaker.

Software is developed in other countries—India, Russia, Israel, France, Germany, Estonia, and many other countries. But the dominance of the U.S. and of the English language is clear. The reason for this is not only that the U.S. is the source for much of computer technologies, but also—and probably more importantly—that the U.S. is the biggest consumer market for software. So developers in all countries tend to optimize the product for the highest-paying consumers, and these only need English.

When engineers write the user interface of their software in English, they often do not give any thought to other languages at all, or make translation possible, but complicated by English-centric assumptions about number, gender, text direction, text size, personal names, and plenty of other things, which will be explored in further points.

2. Terminology

English is also the source for much of the computer world’s terminology. Other languages have to adapt terms like smartphone, network, token, download, authentication, and thousands of others.

Some language communities work hard to translate them all meticulously into native words; Icelandic, Lithuanian, French, Chinese, and Croatian are famous examples. This is nice, but requires effort on behalf of terminology committees, who need to keep up with the fast pace of technological development, and on behalf of the software translators, who have to keep with the committees.

Some just transliterate most of them: keep the term essentially in English, but rewritten in the native alphabet. Hindi and Japanese are examples of that. This seems easy, but it is based on a problematic assumption: that the target language speakers who will use the software know at least some English! This assumption is correct for the translators, who don’t just know the English terms, but are probably also quite accustomed to it, but it’s not necessarily correct for the end users. Thus, the privilege is perpetuated.

Some languages, such as Hebrew, German, and Russian, are mid-way, with language academics and purists pulling to purer native language, engineers pulling to more English-based words, and the general public settling somewhere in between—accepting the neologisms for some terms, and going for English-based words for others.

For the non-English languages it provides fertile ground for arguments between purists and realists, in which the needs of the actual users are frequently forgotten. All the while, English speakers are not even aware of all this.

3. Easy binary logic word formation

One particular area of computer terminology is binary logic. This sounds complicated, but it’s actually simple: in electronics and software opposite notions such as true / false, success / failure, OK / Cancel, and so forth, are very common.

This translates to a great need for words that express opposites: enable / disable, do / undo, log in / log out, delete / undelete, block / unblock, select / deselect, online / offline, connect / disconnect, read / unread, configured / misconfigured.

Notice something? All of the above words are formed with the same root, with the addition of a prefix (un-, dis-, de-, mis-, a-), or with the words “on” and “off”.

A distinct, but closely related need, is words for repetition. Computers are famously good at doing things again and again, and that’s where the prefix re- is handy: reconnect, retry, redo, retransmit.

These features happen to be conveniently built into the English language. While English has extremely simple morphology for declension and conjugation (see the section “Spell-checking” in part 1 of the series), it has a slightly more complex morphology for word formation, but it’s still fairly easy.

It is also productive. That is, a software developer can create new words using it. For example, the MediaWiki software has the concept of “oversight”—hiding a problematic page in such a way that only users with a particular permission can read it. What happens if a page was hidden by mistake? Correct: “unoversight”. This word doesn’t quite exist elsewhere, but it doesn’t sound incorrect, because familiar English word formation rules were used to coin it.

As it always happens, English-speaking software engineers either don’t think about it at all, or think that other languages also have similar word formation rules. If you haven’t guessed it already, it is not true. Sime other European languages have similar constructs, but not necessarily as consistent as in English. And for Semitic languages like Hebrew it’s a disaster, because in Semitic languages prefixes are used for entirely different things, and the grammar doesn’t have constructs for repetition and negation. So when translating software user interface strings into Hebrew, we have to use different words as opposites. For example the English pair connect / disconnect is translated as lehitḥabér / lehitnaték—completely different roots, which Hebrew is just lucky to have. Another option is to use negative words like lo and bilti, or bitul, but they are often unnatural or outright wrong. Having to deal with something like “Mark as unread” is every Hebrew software translator’s nightmare, even though it sounds pretty straightforward in English.

English itself also has pairs of negative words that are not formed using the above prefixes, for example next / previous and open / close, but in many other languages they are much more common.

4. Verbing

“Verbing weirds language”, as one of the famous Calvin and Hobbes panels says.

Despite being a funny joke in the comic, it’s a real feature of the English language: because of how English morphology and syntax work, nouns can easily jump into the roles of adjectives and verbs without changing the way they are written.

For English, this is a useful simplification, and it works in labeling, as well as in advertising. “Enjoy Coca-Cola” is something more than an imperative. The fact that it’s a short single word and that it’s the same in all genders and numbers, makes it more usable as a call to action than it would be in other languages. And, other than advertising, where are calls to action very common? Software, of course. When you’re trying to tell a user to do something, a word that happens to be both the abstract concept and the imperative is quite useful.

Perhaps the most famous example of this these days is Facebook’s “Like”. Grammatically, what is it in English? Imperative? A noun describing an abstract action? Maybe a plain old noun, as in “chasing likes” (this is a plural noun—English verb don’t have a plural form!)? Answer: it’s all of them and more.

When translated to Hebrew in Facebook’s interface, it’s Ahávti, which literally means “I loved it”. Actually, this translation is mostly good, because it’s understandable, idiomatic, and colloquial enough without compromising correctness. Still, it’s a verb, which is not imperative, and it’s definitely not a noun, so you cannot use it in a sentence as if it was a noun. Indeed, Hebrew speakers are comfortable using this button, but when they speak and write about this feature, they just use its English name: “like” (in plural láykim). It even became a slightly awkward, but commonly used verb: lelaykék. Something similar happens in Russian.

It would be impossible in Hebrew and Russian to use the exact same word for the noun and the verb, especially in different persons and genders. Sometimes the languages are lucky enough to be able to adapt an English verb in a way that is more or less natural, but sometimes it’s weird, and hurts the user experience.

5. Word length

This one is relatively simple and not unique to English, but should be mentioned anyway: English words are neither very long, nor very short. Examples of languages where words are, on average, longer than in English, are Finnish, Tamil, German, and occasionally Russian. Hebrew tends to be shorter, although sometimes a single English word has to be translated with several Hebrew words, so it can get also get longer. This is true for a pretty much any language, really.

In designing interfaces, especially for smaller screens, the length of the text is often important. If a button label is too long, it may overflow from the button, or be truncated, making the display ugly, or unusable, or both.

If you’re an English speaker, it probably won’t happen with you, because almost all software is usually designed with the word length of your language in mind. Other languages are almost always an afterthought.

The good practice for software engineers and designers is to make sure that translated strings can be longer. Their being shorter is rarely a problem, although sometimes a string is so short that the button may become to small to click or tap conveniently.


Generally, what can you do about these privileges?

Whoever you are, remember it. If you know English, you are privileged: Software is designed more for you than for people who speak other languages.

If you are a software engineer or a designer, at the very least, make your software translatable. Try to stick to good internationalization practices and to standards like Unicode and CLDR. Write explanations for every translatable string in as much detail as possible. Listen to users’ and translators’ complaints patiently—they are not whining, they are trying to improve your software! The more internationalizable it is, the more robust it is for you as a developer, and for your English-speaking users, too, because better design thinking will be going into each of its components, and less problematic assumptions will be made.

Five Privileges of English Speakers, part 1

It’s very common today on progressive blogs to urge people to check their privilege.

Being an English speaker, native or non-native, is a privilege.

It’s not as often as discussed as other forms of privilege, such as white, male, cis, hetero, or rich privilege. The reason for this is simple: The world’s media is dominated by the English language. English-language movies are more popular in many countries than movies in these countries’ own languages, English-language news networks are quoted by the rest of the world, the world’s most popular social networks are based in the U.S. and are optimized for U.S. audiences, etc.

So, when English speakers discuss privilege among each other, English is not much of an issue, and they dedicate more time to race, gender, wealth, religion, and other factors that differentiate between people in English-speaking countries.

Despite this, I am not the first one to describe English as a privilege. A simple Google search for english language privilege will yield many interesting results.

What I do want to try to do in this series of posts is to list the particular nuances that make English such a privilege in as much detail as possible. I wanted to write this for a long time, but there are many such nuances, so I’ll just do it in batches of five, in no particular order:

1. Keyboard

If you speak English, congratulations: A keyboard on which your language can be written is available on all electronic devices.

All of them.

All desktops, laptops, phones, tablets, watches. The only notable exception I can think of is typewriters, which only makes the point more tragic: technology moved forward and made writing easier in English, but harder in many other languages, where local-language typewriters were replaced with computers with English-only keyboard.

At the very worst case, writing English on a computer will be slightly inconvenient in countries like Germany, France, or Turkey, where the placement of the Latin letters on the keys is slightly different from the U.S. and U.K. QWERTY standard. Oh, poor American tourists.

On a more serious note, though, even though a lot of languages use the Latin alphabet, a lot of them also use a lot of extra diacritics and special characters, and English is one of the very few that doesn’t. Of the top 100 world’s languages by native speakers, only Malay, Kinyarwanda, Somali, and Uzbek have standardized orthographies that can be written in the basic 26-letter Latin alphabet without any extra characters. We can also add Swahili, which has a large number of non-native speakers, but that’s it. With other languages you can get stuck and not be able to write your language at all (Hindi, Chinese, Russian, etc.), or you may have to write in a substandard orthography because you can’t type letters like é or ł (French, Vietnamese, Polish, etc.).

The above is just the teeny-tiny tip of the iceberg; the keyboard problem will be explored in more points later.

2. Spell-checking

English word morphology is laughably simple.

There’s -s for plurals and for third person present tense verbs, there’s -‘s for possession, and there are -ed and -ing verb forms. There are also some contractions (‘d, ‘s, ‘ll, ‘ve), and a long, but finite list of irregular verb forms, and an even shorter list of irregular plural noun forms. And that’s it.

Most languages aren’t like that. In most languages words change with prefixes, suffixes, infixes, clitics, and so on, according to their role in the sentence.

Beyond the fact that English writing is (arguably) easier for children and foreigners to learn, this means that software tools for processing a language are easy to develop for English and hard to develop for other languages.

The first simple example is spell-checking.

English has had not just spelling, but also grammar and style checkers built into common word processors for decades, and many languages of today don’t even have spelling checkers, not to mention grammar, or style, or convenient searching. (See below.)

So in English, when you type “kinh”, most word processors will suggest correcting it to “king”, but then, some of them may also suggest replacing this word with “monarch” to be more inclusive for women, and this is just one of the hundreds of style improvement suggestions that these tools can make. For a lot of other languages, even simple spell-checking of single words hasn’t been developed yet, and grammar checking is a barely-imaginable dream.

3. Autocompletion

Simpler morphology has many other effects.

Even though Russian is my first native language and I speak it more fluently than I speak English, I am much slower when I’m typing in Russian on my phone. In English, the autocompleting keyboard makes it possible to write just two or three letters of a word and let the software complete the rest. In Russian, the ending of the word must be typed, and autocompletion rarely guesses it correctly. Typing an incorrect ending will make a sentence convey incorrect information, or just make it completely ungrammatical.

4. Searching

A yet-another issue of the previous point, English’s very simple morphology makes searching easier.

For example word processors have a search and replace function. For English, it will likely find all forms of the word, because there are so few of them anyway. But in Hebrew and Arabic, letters are often inserted or changed in the middle of the word according to its grammatical state, and you need to search for each form, which is quite agonizing. It’s comparable to “man” vs. “men” in English, except that in English such changes are very rare, while in many other languages it happens in almost every word.

With search engines that must find words across thousands of documents it gets even harder. Google can easily figure out that if you’re searching for “drive”, you may also be interested in “driving”, “drove”, and “driven”, but Russian has dozens of other forms for this word. A few languages are lucky: special support was developed for them in search engines, and tasks of this kind are automated, but most languages our just out in the cold. But English barely needs extra support like this in the first place.

5. Very little gender

A lot can be said about gendered language, but as far as basic grammar goes, English has very little in the area of gender. “He” and “She”, and that’s about it. There are also man/woman, actor/actress, boy/girl, etc., but these distinctions are rarely relevant in technology.

In many other languages gender is far more pervasive. In Semitic and Slavic languages, a lot of verb forms have gender. In English, the verb “retweeted” is the same in “Helen retweeted you” and “Michael retweeted you”, but in Hebrew the verb is different. Because Twitter doesn’t know that Helen needs a different verb, it uses the masculine verb there, which sounds silly to Hebrew speakers.

I asked Twitter developers about this many times, and they always replied that there’s no field for gender in the user profile. It becomes more and more amusing lately, now that it has become so common —and for good reasons!— to mention what one’s preferred pronouns are in the Twitter profile bio. So people see it, but computers don’t.

On a more practical note, in the relatively rare cases when third person pronouns must be used in software strings, English will often use the singular “they” instead of “he” or “she”. So English-speaking developers do notice it, but not as often as they should, and when they do, they just use the lazy singular-they solution, which is socially acceptable and doesn’t require any extra coding. If only they’d notice it more often, using their software in other languages would be much more convenient for people of all genders.

The only software packages that I know that have reasonably good support for grammatical gender are MediaWiki and Facebook’s software. I once read that Diaspora had a very progressive solution for that, but I don’t know anybody who actually uses it. There may be other software packages that do, but probably very few.


These are just the first five examples of English-language privilege I can think of. There will be many, many more. Stay tuned, and send me your ideas!

Doing Mobile App Localization Right: Baby Daybook

When we became parents in 2014, my wife searched for a mobile app to help manage baby care: feeding, sleep, diaper changing, and other activities.

After trying a few apps, she settled on Baby Daybook by Drilly Apps. It indeed turned out to be very convenient for parenting in the first few months, but here I wanted to praise its developers for doing localization right:

  • The app is translatable at OneSky. It’s one of many other localization sites existing today. It’s probably less famous than Crowdin and Transifex, but pretty fine functionally, and I have not particular complaints about it. (And of course, it’s a bit of a competitor of translatewiki.net, where I am one of the maintainers, but that’s OK—competition is healthy.)
  • Any volunteer can log in with a GitHub account and start translating to any language.
  • Translation review is available, but not required. Submitted translations go into the next released version whether reviewed or not. This is good, because bad translations are actually very rare, and many languages have very few dedicated translation maintainers, often just one, so demanding translation review is usually just harmful and unnecessary overhead.
  • The developers quickly answer my emails when I ask them for clarification about string meanings, and when I suggest changes in the English string. Recently I suggested changing “Thousands of happy moms use this app to track breastfeedings and sync data” to “Thousands of happy parents use this app to track breastfeedings and sync data”, and they immediately changed it.
  • App descriptions for the Google Play store are fully translatable. This may seem like an obvious thing to do, but almost all apps use a machine translation, which is almost always wrong, often embarrassingly so. I saw some very popular apps, used by millions of people, having Hebrew and Russian descriptions that are worse than useless. For a lot of apps it would be better to just leave the English description. Luckily, Baby Daybook does the right thing and allows translators write a complete description for every language.
  • Translators get a free pro account in the app, with extra features. (This worked for me the last time I checked, in 2014; I haven’t checked recently, but I have no reason to think that it changed.)

All of the above things are really easy and sensible, but for some reason most app developers don’t do this.

App developers, please learn from Drilly Apps how to do it right.

The Case for Localizing Names

I often help my friends and family members open email accounts. Sometimes they are starting to use the Internet and sometimes they move from old email services (Yahoo, Walla!, ISP) to something modern (like it or not, Gmail).

At some point they have to fill their name, which will appear in the “from” field. And then I have to suggest them to write it in Latin characters, even though most of them speak languages that aren’t written in Latin characters – mostly Hebrew and Russian. Chances are that some day they will send an email to somebody who cannot read Russian or Hebrew, and Latin is relatively better known.

Only relatively, though. It may seem obvious to you that everybody knows the Latin script, but in fact, a lot of people are not comfortable with it at all. There are also other complications: lossy and inconsistent transliteration rules (is Amir אמיר or עמיר?), potential right-to-left rendering problems, and more. And of course, all people are happy to see their name in their language.

And people are also happy to see their friends’ names in their own language and not in a foreign or a neutral language. I have, for example, a lot of friends in India. Most of them write their names in English, but some write it in Marathi or in Malayalam. It’s certainly good for them, but in practice it’s much harder for me to find them this way, so English would be better – but Hebrew or Russian would be better yet.

Finally, there are a lot of people in the world who have more than one linguistic background. Mine are Russian, Hebrew and English, and I am really not such a special case. There are many millions of immigrants who have mixed backgrounds: Punjabi-Hindi-Urdu-English, Kurdish-Turkish-German, Kazakh-Russian-Norwegian, and others, and others and others. From each of these backgrounds they have friends, co-workers and family members, with whom they would love to communicate in the respective language. In each of these backgrounds they have friends who would want to find them using the name under which they know them there and using the appropriate language and writing system.

And sometimes people change their names, too. I did it (twice!), and so have many other people.

All this means that people’s names should be translatable, just like books, articles and software interfaces. Facebook and Google+ allow me to add a very limited number of names in foreign languages. Why wouldn’t they let me write my name in four, five, ten languages? This would make it easier for people who speak these languages to find me and to communicate with me. I would go even further and allow people who speak languages that I don’t know well to write my name as their hear it in their language and to add it to my details. Yet again, this would make me easier to find to even more people.

Some degree of automation can be possible. A lot of names are, after all, repetitive, so social networks would be able to suggest people with common names how their name would be written in other languages.

Wikipedia is actually quite good in this regard: Usually people have the same username across projects, and this username is not necessarily written in Latin letters, but people can customize the appearance of their signature in each project. I did it in a few languages, and people who speak those languages appreciate it.

I can only hope that social networks and email systems will allow as much flexibility as possible with this.

English typing computer

I’m in an Internet cafe in Mumbai. I tried to install Firefox with the Marathi interface, but on the computers here fonts for languages of India are not installed. That’s right – on computers in India fonts for languages of India are not installed. Hence, installing Firefox in Marathi failed at the very first stage, because the fonts are needed for the installation wizard.

Actually, I’m not surprised that these fonts are not installed, because it’s not my first time in India. I know that it happens a lot in this country. I would install them, but I don’t have a permission.

I find it incredibly weird – and tragic – that so many people in India don’t even try to use computers in any language except English. The one curious thing that I did find was an “English typing computer” shop. It’s just a place where you can use a computer to write Word documents in Hindi or Marathi, but using an English-based transliteration keyboard rather than the standard Indian Devanagari InScript keyboard, because they find transliteration keyboards easier. Of course, they could just install such a keyboard layout on their computers… but they prefer to go to an “English typing computer” shop.

We, software internationalization people, have so much more work to do.

Always define the language and the direction of your HTML documents, part 01

I received this email from Safari Books Online:

Email in English from Safari Books, oriented like Hebrew
Email in English from Safari Books, oriented like Hebrew. Click to enlarge.

The email is written in English, but notice how the text is aligned unusually to the right. Notice also that the punctuation marks appear at the wrong end of the sentence. I used Firefox developer tools to apply the correct direction, and saw it correctly:

The same email, with corrected left-to-right formatting using Firefox developer tools
The same email, with corrected left-to-right formatting using Firefox developer tools

This happens because I use GMail with the Hebrew interface. GMail has to guess the direction of the emails that I receive, because in plain text there’s no easy way to specify the direction (I hope to discuss it in a separate post soon). Usually GMail guesses correctly. Ironically, for HTML-formatted emails like this one, GMail often guesses incorrectly, even though in HTML, unlike in plain text, it’s quite easy to specify the direction by simply adding dir=”ltr” to the root element of the email.

Unfortunately a lot of HTML authors don’t bother to specify explicit direction. Many are not even aware of this exotic dir attribute. Others think that because “ltr” is the default, they don’t have to specify it. They are wrong: As this email shows, the left-to-right HTML content is embedded in a right-to-left environment, and the “rtl” definition propagates to the embedded content.

You could blame GMail, of course, but it’s much more practical to always define the direction of your HTML content, even if it’s the default. You can never know where will your content end up.

P.S.: I read this post before publishing and suddenly realized that its style is quite similar to “Best Practices” books, such as Damian Conway’s classic “Perl Best Practices” – it tells you to do something that is not obviously needed, and explains why it is needed nevertheless. I like to acknowledge sources of inspiration. Thank you, Damian.