Public Services and Procurement Canada
Symbol of the Government of Canada

Institutional Links

 

Important notice

This version of Favourite Articles has been archived and won't be updated before it is permanently deleted.

Please consult the revamped version of Favourite Articles for the most up-to-date content, and don't forget to update your bookmarks!

Search Canada.ca

Machine translation in a nutshell

André Guyon
(Language Update, Volume 8, Number 1, 2011, page 30)

In my first article of 2011, I’ve decided to provide a layman’s overview of machine translation. I won’t waste time debunking the most common myths. Rather, I plan to deal a death blow to one myth in particular: that translators will be replaced by machines.

To be sure, machines are getting better and better at recycling translations thanks to translation memories and statistical machine translation. Granted, textual predators of all kinds have been using them to fatten their already obscene profit margins. And yes, the confusion between translators and translation software is spreading like an aggressive cancer.

Nevertheless, when used as a substitute for language professionals, translation tools invariably do more harm than good. More on this in my conclusion, but first, to provide you with a better grasp of the subject, let’s take a look at how the different machine translation systems work.

Statistical machine translation systems

Statistical machine translation (SMT) systems have three main components: the language model, the translation model and the decoder. Let’s take a closer look at these components.

Language model

A language model is a series of word sequences extracted from as many texts as possible. Today’s computers are sufficiently powerful to produce lists of word sequences of between one and eight words. Specialists call these sequences n-grams. An n-gram of size 1 is called a unigram, an n-gram of size 2 is called a bigram, and so on, right up to the octagram.

How is it useful?

The language model resolves ambiguities, that is, it chooses the right word or group of words. For example, speech recognition software will encounter two homophonic ambiguities in the following sentence: "His boy fought in the Boer War." Even if the software recognizes my voice perfectly, it must decide whether the second word is "boy" or "buoy" and whether the sixth word is "bore," "boar," "boor" or "Boer."

In the language model lists, it will probably find that "the Boer War" is much more common than all the other possibilities. The model will likely contain "boy fought" but not "buoy fought." At the very least, it will find that "boy" is more common than its homophone.

How does the software go about creating these lists?

The software simply identifies individual words using spaces, adds them to a list and counts their occurrences. The result: a table of single words. It then repeats the process for groups of two words, yielding a table of groups of two words, and so on up to eight words.

Translation model

Also known as "translation tables," the translation model is a series of tables of groups of words and their equivalents. For example, the equivalent of "Mais qu’est-ce qu’un Schtroumpf?" is "What the Smurf is a Smurf?" The tables are created from aligned bilingual corpora.

How is it useful?

The translation model allows the machine translation engine to find and substitute the target-language wording that most frequently corresponds to the source-language wording in a series of texts. Substitutions are also conditioned by other parameters.

How does the software go about creating these tables?

First, it aligns the sentences or starts with sentences that are already aligned (i.e. matching segments, as in a translation memory). It then aligns the words that make up these aligned sentences. Some software applications include an alignment agent, others do not.

The statistical procedure is quite simple. It consists in identifying which words are almost always present at the same time in matched source- and target-language sentences. By process of elimination, the right word or group of words is identified in almost every case. So we find that "cat = chat," "Canada = Canada," etc. We also find that some words are translated in more ways than one.

Then, using a slightly more complex algorithm, the software does the same thing for groups of two words, three words and so on, up to eight words. For example, although "time = temps" and "flies = mouches," Google Translate, an SMT system, can translate "Time flies" by "Le temps passe vite." However, when it encounters a less common expression, the result can be a nonsensical translation.

The decoder

The decoder is the part of the software that takes the source text, searches for segments from longest to shortest, and applies the language model for the target language. It’s called a "decoder" because in machine translation, languages are viewed as a series of codes that must be decoded.

How is it useful?

The decoder produces a relatively crude statistical machine translation.

How does it work?

Generally speaking, the decoder conducts a lengthy series of search and replace operations, starting with the longest sequences of words and ending with the shortest. This avoids wacky translations such as "mouches du temps" for "time flies," a well-known example among machine translation researchers.

However, sometimes a valid translation contained in the table is not used because its frequency is too low.

When combined with the global replacements, the language model helps restore a more normal form to the output. To a great extent, the subtlety of the various software applications lies in the relative priority given to the two components in question (language model and translation model).

Rule-based machine translation systems

Systems based on linguistic rules (those seen on the Web in the past) translate using a dictionary in conjunction with a set of linguistic rules. These systems analyze texts in the same way as text correction software applications like Antidote or Grammatik, which are based on the same linguistic theories.

Imagine two tables, one for the dictionary, the other for the linguistic rules. The software identifies words or phrases belonging to various grammatical categories, then searches the table of linguistic rules for the grammatical categories that appear in the sentence (verb, subject, complement, etc.). It then applies the equivalent target-language rules to the various equivalents in the target column of the dictionary. For example, "siéger à quelque chose = sit on something" when the something is not an object.

These systems represent the world as a series of relationships, a bit like object-oriented programming languages.

In the late 1980s, the Translation Bureau conducted trials using a rule-based machine translation system called LOGOS (now available as open-source software). Users could add to the dictionary and rule set, but this proved to be a painstaking task.

In contrast, rival SMT systems have the advantage of requiring no human intervention, but have the drawback of requiring large amounts of data.

Compared with SMT systems, rule-based systems generally yield superior output in terms of syntax. For example, gender and number agreement are usually correct.

Hybrid machine translation systems

In principle, hybrid systems offer the best of both worlds. They automatically populate dictionaries using corpora much smaller than those required by SMT systems. However, researchers have found that for the moment, SMT systems with access to enough data yield better results than do hybrid systems. It is conceivable that in the medium term, this will no longer be the case.

Some hybrid systems use a statistical engine to identify errors that are routinely corrected by language professionals in rule-based machine translation output. Researchers refer to this as statistical post-editing (SPE).

Recently, suppliers of rule-based systems (notably Systran) have become suppliers of hybrid systems. Vendors of translation memory management software have, for their part, decided to incorporate statistical engines into their software to varying degrees.

Some final thoughts

We can safely say that in a few years’ time, hybrid systems will be the only game in town. The quality of hybrid systems will ultimately surpass that of purely statistical systems.

Since SMT systems primarily recycle expressions that already exist, we must continue to feed them with translations, or they will cease to be useful. New expressions are being created every day!

If language professionals are seen as mere competitors of machines, the most talented will turn to other pursuits, the quality of the raw material that allows machine translation systems to work their magic will decline, and hiring people who actually know how to write, translate, interpret and manage terminology will become an increasingly costly proposition.

Machine translation is here to stay. Why not make it another tool in the language professional’s kit?

A Brief Glossary of Machine Translation Terminology

Term Definition

aligned bilingual corpus

(corpus bilingue aligné)

Bilingual corpus in which each segment (sentence or group of sentences) in the source language is matched to a segment in the target language.

bilingual corpus

(corpus bilingue)

Collection of texts that exist in both source language and target language.

bitext

(bitexte)

Text containing both source language and target language. Unlike in a translation memory, the order of the segments is preserved.

corpus

(corpus)

Collection of texts.

decoder

(décodeur)

The actual machine translation engine.

language model

(modèle de langue)

List of groups of words (n-grams) and their frequency in a corpus.

machine translation

(traduction automatique)

Translation produced by a machine. Machine translation can be rule-based, statistical or hybrid.

MT output

(sortie machine)

Raw machine translation output.

post-editing

(postédition)

Processing performed on machine output to render it acceptable.

rule-based machine translation (RBMT)

(traduction automatique à base de règles (TABR))

Classic machine translation that involves a host of rules defined by linguists in grammar books. RBMT systems include Systran, Reverso, Prompt and LOGOS.

segment

(segment)

The unit of text being translated. A segment is generally a sentence or sequence of words set off by a return or a punctuation mark.

statistical machine translation (SMT)

(traduction automatique statistique (TAS))

Machine translation that relies exclusively on statistics, in contrast to rule-based machine translation. SMT systems include LanguageWeaver, PORTAGE and Moses.

target language

(langue d’arrivée, langue cible)

Language in which the translation is being produced.

training

(entraînement)

Process whereby a statistical system is "created." If the output is compared with one or more human translations, the parameters are adjusted.

translation

(traduction)

Transfer of a message from one culture to another.

translation memory

(mémoire de traduction)

Collection of segments (sentences) and their translations. Unlike in a bitext, the segments are not necessarily in order.

translation model

(modèle de traduction)

List of the frequencies with which groups of words (n-grams) in the source language correspond to equivalent groups of words in the target language.

word alignment

(alignement de mots)

Process whereby each word in a source sentence is matched to an equivalent word in the target sentence. For example, if, for each occurrence of "Smurf" in the English source text, "Schtroumpf" appears in the corresponding segment in the French target text, chances are that "Smurf = Schtroumpf."