Public Works and Government Services Canada
Symbol of the Government of Canada

Institutional Links

 
Search TERMIUM Plus®

Understanding search engines

André Guyon
(Language Update, Volume 9, Number 2, 2012, page 22)

In a previous column,* I pointed out how much of a difference a keyword search can make. I would now like to draw your attention to everyday elements of language like “insignificant” particles, which could help you understand search engines, as well as the research function integrated into tools for language professionals.

Noise words

Experienced TERMIUM Plus® users are well aware of this concept. They know there’s no point in using noise words—those small, omnipresent words like articles and prepositions or, if you’ve learned the new grammar, determiners—in their searches. These words are characterized as noise because they have little meaning and are mostly used to link words in a sentence.

Therefore, in TERMIUM Plus®, regardless of whether you enter “gouvernement au Canada” or “gouvernement du Canada” in the “French Terms” search field, you’ll be forwarded to the same record. If that surprises you, then it may also surprise you that the Government of Canada’s terminology and linguistic data bank is not the only one that works this way—in fact, far from it!

It’s done on purpose, but why? Simply because indexing extremely common words slows down most indexes significantly.

In extreme cases, searching for word combinations or expressions such as “one of the” or “oui mais” could easily take a hundred or even a thousand times longer than searching for an expression made up of two “significant” words.

Nowadays

Machines and software are more powerful than ever. As a result, designers of new products often choose to limit indexes initially and then expand them gradually to include numbers and noise words, depending on the means available.

The length of queried expressions and the fascinating effects of repetition

We know that Google indexes billions of documents in English. So, for fun, let’s conduct a search of more than just two or three words in the world’s most popular search engine.

Allow me to state the obvious: the longer the sentence, the more uncommon it is, even in a gigantic corpus. Is this true for 100-word sentences only? Is it also true for 20‑word sentences? And is it also true for 15‑word sentences? Let’s see just how true this is.

Let’s perform an exact search on part of a question often asked on Google: “Why doesn’t she love.”

We should get nearly 1.5 million hits. When we add the word “me,” we should get roughly half the number of hits (876,000). Now let’s add the word “anymore.” The number of hits drops to 125,000, even though our sentence is extremely common. Now let’s add “like.” We’re left with a measly 913 hits.

The most fascinating part is that most of the 913 hits are found in the sentence “Why doesn’t she love me anymore like I love her?”

Just luck perhaps? Well, let’s try “The history of Canada” instead, then add “is,” then “not” and then “quite.” We get a few thousand hits, most of which seem to appear in the sentence “The history of Canada is not quite as explosive.”

So what does that prove? Just that even in a huge corpus, a sentence beyond a certain length will generally be found in identical or very similar contexts.

In short, the length of sentences queried will probably replace many of the complex document classification mechanisms that preoccupy so many semantic Web researchers.

Advice for my language professional colleagues

In addition to using the right keywords, make your exact searches longer, even if it means shortening them if you find nothing.

Obviously, if a search engine allows for a cascading search, you’ll get even better results.

A cascading search is a search conducted in a target set of records (e.g. a particular corpus), but if the initial search criteria produce no result, the target set is broadened according to the user’s preferences.

I think that the Translation Bureau will probably want to apply this logic to the tools for its language professionals. For example, users of a shared terminology tool could search first in their own records, then in their own team’s records, then in those of other teams working in similar fields, and then as a last resort in the full database of records.

Found vs. used: What are the real savings?

All too often, bilingual concordancers and bitext-based translation memories give results based on the number of matches found (as opposed to no-hit searches). Their designers assume that what was found will be used, which equals the savings. Sometimes they even go so far as calculating the dollar value of the time saved, often without using any realistic measure.

Replacing an old way of searching with a faster one yields savings if—and only if—the results are usable. Anyone whose calculations do not take into account the fact that approximately 20% to 25% of all successful searches will not be used is wearing rose-coloured glasses. What’s more, from these savings must be subtracted the unsuccessful searches that would have been faster if done otherwise. Personally, I always prefer calculations that allow for a substantial margin of unused hits.

Back to remark 1* See “Favourite Articles My quest for information in 2010,” Language Update, vol. 7, No. 2 (June 2010).