Halve the effort: Automatically detect duplicate content in 4 steps

Drafts, working copies, old versions. Over time, numerous variants of a document can accumulate. For example, in marketing, the same text appears in a flyer, a brochure and a newsletter. Duplicates also occur in documentation, for example when content for very similar products is maintained separately. In addition, sometimes information is copied from one knowledge silo to another so that more users can access it.

Why is duplicate content bad?

At first glance, all these copies seem justified. After all, they serve a purpose. Over time, however, duplicate content leads to problems. If something changes, the information must be updated in several places. If a copy is overlooked or a small mistake happens, contradictions arise. If work then continues on several versions, users may struggle to determine which variant is current. Or is any version really correct? This confuses users who wonder which one applies to their case.

In other words, duplicate (or even triplicate and quadruplicate) content

doubles the maintenance effort
leads to errors and follow-up costs because the reliability of the information is reduced
impairs the user experience by making users feel confused

What can I do about duplicate content?

Content management prevents problems with duplicate content. It ensures that everything that belongs together can be found easily. To do this, the information is stored centrally and provided with the necessary metadata. A monitored process with regular checks according to the 4-eyes principle is also part of this.

But what can you do if the duplicates are already there? Searching large collections of documents by hand takes a long time and achieves little. It is quicker and easier to automatically detect duplicate content. This is done by measuring the similarity of two documents. The result is a list of duplicates. Afterwards, one version can be discarded or two documents can be made into one.

How can I automatically detect duplicate content?

This requires the following 4 steps:

Step 1: Collect documents

First you need to know where the information is located and in what format. Wikis, shared drives and data shares are the usual suspects. In terms of formats, you will mainly be dealing with Word, PDF, HTML and, in marketing, InDesign. Pay attention to how often which formats are used. This will save you work in the next step.

Step 2: Extract content

This is the most technically complex step. The text of all documents must be brought into a uniform form. Pure text without markup or layout is best. This can be done automatically. There are tools that support this. If there is no ready-made solution for one of your formats, you have to make a choice. Either you (or the developer of your choice) write a small programme that extracts the text; or you ignore the format. Which is better depends on

how often the format occurs
how likely it is that there are duplicates in this format

Step 3: Vectorise

This step does the magic but comes with minimal effort. Computers do have a hard time processing text. But there are ready-made solutions for this obstacle. In Python, for example, packages like scikit-learn or gensim provide everything you need. They make it possible to turn your documents into vectors with just a few lines of code. And computers can work very well with vectors.

Put simply what is happening here is that a list of all the words that appear in your documents is created. Then it is counted how often each word occurs in the respective document. So the document becomes a series of numbers. These numbers can be understood as a point or vector in a coordinate system. Similar documents (i.e. those in which the same words occur similarly often) are close to each other.

Tip: Before you convert the documents, you should

Remove stopwords. These are words that occur often but have little meaning. These include articles, linking words and auxiliary verbs. There are ready-made lists for most languages that you can use to filter out the stopwords automatically.
Remove numbers. If two documents are the same except for the date, a phone number or the product version, they are still duplicate content. Therefore, replace numbers with a placeholder.

Step 4: Measure similarity

Now you only have to measure the distance between your documents. There are different methods of measurement. Common methods are:

Euclidean distance (a straight line between the points)
Taxicab geometry (sum of the differences for each coordinate)
Cosine similarity (angle between the vectors)

Regardless of how you measure, you get a value for each pair of documents. You can easily find out which values indicate duplicates by taking samples.

In our projects we use the cosine similarity. It lies between 0 (documents without shared properties) and 1 (identical documents). Experience shows that from a similarity of 0.95, documents are duplicates. Mostly, a couple of short sentences are missing in one version or individual names have been exchanged. With values between 0.9 and 0.95, the documents are still very similar, but with important differences, such as an additional work step.

Let the programme you use for measuring create a list for each document, naming all documents particularly similar to it. This gives you an overview of all duplicates in your collection.

What do I do now with the duplicate content?

That depends on the case:

If the media are different (e.g. flyer and brochure), you probably still need both versions. Make sure that all users know that there are multiple copies. For example, store all variants in one place or use shortcuts. Tip: The automatic detection of duplicates will alert you if the copies are unintentionally different.
You should archive or remove old versions. This way you prevent someone from accidentally using outdated information.
Drafts should be clearly marked as such and possibly kept separately. End users (whether internal or external) should only have access to final released versions. This way, only verified information is circulated.
You can merge similar documents into one. This relates both to documents that describe similar things or processes; and to cases where work has continued on several copies of a document. This has several advantages:
- If anything changes, you only have to change the content in one place. You save time and no contradictions can arise.
- There is no danger of anyone confusing the cases, as the differences are clearly visible. This prevents mistakes.
- As you reduce the number of documents, it becomes easier to find the document you need. This saves time.

Conclusion

Duplicate content costs time and leads to errors. It creates additional work and can cause confusion. The only way to identify duplicates reliably and efficiently is to use automation. The most complex step here is extracting the text from the documents. Once this is done, you can quickly and easily create an overview of all duplicates. With this list, it is then easy to identify and eliminate problems and risks.

Imprint:

Date: 2022
Contact: marketing@avato.net