How to Automate your Glossary in Content Creation

Knowledge Management (KM) is becoming increasingly important. It leverages sales, efficiency, and employee satisfaction. KM is an interdisciplinary task, requiring the cooperation of subject matter experts, IT, writers, and management. Together they create a shared knowledge base that is the foundation of cooperation, communication, and daily business.

Having a shared understanding of important terms is key when creating the content for a knowledge base. Else the information users will get confused, removing most of the information’s value. Glossaries help to avoid such issues by providing shared definitions, allowed and forbidden synonyms and abbreviations. But do you really want all your authors look up every single term? How should they know if there is a definition for a term?

Your Glossary should tell them proactively. You need to automate your Glossary. The following steps tell you how to do it on a basic, advanced, and complex technological level.

1. Declare the relevant scope

This is the most difficult step. Luckily, it is only necessary for large glossaries that provide different definitions for a term in different contexts. If your glossary differentiates between contexts, you need to tell it which ones apply to the text you want to check. The simple way is to let the author select the relevant scopes.

If you have the resources, you can create a small UI. An even simpler way is to automatically create a CSV-file with a list of all available scopes and let the author put an X next to all applying scopes. The glossary can read this information and thus create a list of relevant terms.

A more elaborate method would be to use text analytics and AI to derive the topic and the relevant scopes from the text. Setting this up will take some time and you will want to get sure it is worth the effort first.

2. Prepare the text

Glossaries store terms in their basic form. In texts, there is grammar. For languages where words do not change that much (e.g. English), you can just ignore that for a basic automation (especially if your Glossary mainly contains nouns). For languages with more inflection and for a more precise result, you need to bring the words back into their basic form. You can do this by just removing the most common suffixes or apply more sophisticated techniques like stemming or lemmatization.

3. Find the used terms

With a simple search script, find the terms, abbreviations and synonyms used in the text. In addition, search for any word that contains multiple capitalized letters, as it is probably an abbreviation. To make the results more useful, let the script categorize the matches into at least 3 groups:

Matches that are almost certainly a misuse of a term; this includes:

forbidden synonyms that are not listed as an allowed term or allowed synonym of another term
abbreviations not listed in the glossary

Matches that might be a misuse; these are mainly phrases that are forbidden synonyms in one context but allowed in another one.
Matches that are probably used correctly; this includes:

phrases listed only as preferred term or allowed synonym
abbreviations listed in the glossary

This categorization helps to decide which matches must be checked and for which ones you are willing to take the risk. Matches of category 1 mean that the author either used a wrong term or the glossary is incomplete. They must be checked. Matches of category 2 should be checked, but if the text is of low relevance or needed urgently, you might skip that. Matches of category 3 only need to be verified for the most important texts, as it is possible that the author used them in a way not covert by the definition in the glossary. But assuming the glossary is mostly complete, and the author knows the topic well, it is unlikely to find mistakes among these matches.

If you want to, you can split up the third category, so it differentiates between the use of allowed synonyms and the preferred term.

4. Do the correction

Once again, there is a fancy and a fast way to do this. The fast one is: Let the search script create a list of all matches, containing some additional information:

the position of the matching phrase in the text (e.g. number of paragraph or sentence)
some words before and after the phrase
the term that produced the match (might be different from the phrase in the text, e.g. if a synonym was used)
the term’s definition
the match’s category

With that information, the author can replace misused terms with the preferred one and check if the definition corresponds to what they wanted to say. If the context provided in the list is not enough to decide, they can go to the text and check it there.

The fancy way is to build an UI or an editor plugin, using the list as input. Let the UI jump from match to match, providing the information listed above and buttons for ignoring the match, replacing it, or requesting a change to the glossary.

With this technique, you can take a considerable amount of work from your authors and proofreaders while making the use of language in your contents more consistent. All it takes is a simple script and a glossary in a format the script can read. Once you have that first version, you can improve it by adding more precise or user-friendly technologies. Your authors will thank you for the support. Your terminologist will thank you for the improved visibility of their work and the fresh input they get from authors. Your Management will thank you for the increased efficiency in content creation. And most important, your information users will thank you for easily understandable information.

Do you have further questions? We will be happy to advise you: marketing@avato.net