Document Clean-up Part 3

Cleaning up legacy document backlogs

Uncontrolled documents - often on shared drives and in mailboxes - do not have the value they should have for the organization. At worst, they even harm, such as if it is personally identifiable information that appears to float illegally around a drive. How do you take life to control the masses?

Part 3 - Duplicates

 When large amounts of documentation are accumulated over time, for example, on a file drive, there will certainly be a lot of duplicates. If you could pinpoint duplicates during the cleanup, the task would be simplified. This article is about finding duplicates.

No matter how obvious it is, it's not a good way to look at file names when looking for duplicates. The same file name is no guarantee that two files are the same. Of course, it helps to watch Windows File Date and File Size. But it is time consuming and robust. Additionally, if you have taken a copy of a file and given it a new name, you will not be able to recognize them as duplicates if you are based on filenames. In other words, it is not a very durable method in major clean-up tasks.


You can calculate some kind of fingerprints for a file called a checksum. The check number originates from the cryptography. Imagine that a person has a file, it is transferred to another, and we would like to be absolutely sure that what is received is also what is being sent, so that both can calculate the checksum. If they are the same, there is no file attached during the transfer. There are so different algorithms, such as MD5 and SHA1, and there is different assurance that they actually are the same when the checksum says it. But the algorithms developed for this can be used by our information specialists quite conveniently. With the uncertainty now associated with algorithms of this type, we can say that if two files have the same checksum, they are duplicates. And the good thing is that only the contents of the file - not its name and other metadata in Windows - are included in the calculation, so two files that are the same, but where the file name has been changed, could be identified by this method. A checksum calculation can therefore be a great help in detecting duplicates.

Brutal deletion

Let's assume that we have done, either ourselves or with a tool, a check-up calculation for all files in a folder hierarchy, how do we use that information? It's crazy to delete all extra copies, leaving only one of each file left. It is also what many tools that can be added to this. But an information specialist may get a little nervous about the prospect of this because the fact that a copy of the file is under Project A but also under Subject B gives the information specialist some hints about a relevant classification of that document. Therefore, it is often more valuable to use duplicate information as an ongoing help to provide an overview of the clean-up. Look carefully at the finishing options if you intend to purchase a tool. The technology provider's sense of clean-up may not be equivalent to the information specialist.

Homemade checksum calculation

If you're into it or can persuade IT to do it, you can quite easily construct some small scripts, which iterates through folder structures and finds files. However, you need to calculate the checksum on each of these files, and for that purpose, Microsoft has a small component, "Microsoft File Checksum Integrity Verifier" that can make the calculations, which can be called from, for example, command line scripts. The component must be downloaded from the Microsoft website (eg search on FCIV). We have done this now and then and delivered the result as lists in spreadsheets, which then with pivot tables etc. have helped us further. But be aware that this is completely untenable if we talk about large quantities. 100 or 1000 files, ok, but that will probably stop too. It will be completely unmanageable in larger quantities.

Tools for deduplication

"Deduplication" is the term most commonly used by software manufacturers to describe that their software can just what we're looking for here, namely removing relayers, and often it's just our well-known checksum thinking behind the tools.
Really many tools to support the IT department in daily file server maintenance work actually have this type of technical duplicate recognition itself and are often geared to handle much more than a home-made script. Often there is also an option to act on the result, such as deleting all duplicates or converting all duplicates to shortcuts to a single instance of the file. It is not certain that this is what the information specialist wants, but you can often just draw lists and work on instead. Try to see if it does not have some file system tools / disk space manager tools that can be made available for a cleanup if there is no economy or need for a dedicated solution.
If you want a tool that can deduplicate but at the same time search for something specific in the files - for example, something with social security number format, it may be worth looking at search tools and especially eDiscovery tools. Many eDiscovery tools are doggy and can do much more than cleanup, for example, they can make legal teams of files. But some of them have a "little brother"; For example, it could be called a file analytics tool or something like that.

Almost the same

Wherever it gets really smart, if a tool recognizes that the contents of two files are the same, even though they are in different formats - eg Word and PDF. There will be a check amount, so the tools that relate to this will not be able to pair these two files. From a checksum, one can not see that something is almost the same. For example, versions 0.9 and 1.0 of a document, or a document and a signed, scanned version of the same document. Contrary to the above, we can not really point out a category of tools or a term to search for tools that may be such, but they exist. We are familiar with a single tool, which can be seen from a kind of fingerprint, but this time of textual content. Really impressive. However, it was quite expensive lately we looked at it.

It is the intention that one of the forthcoming articles should deal with artificial intelligence for classification. The above is not artificially intelligent, but still extremely smart.

These posts will not, where possible, avoid highlighting specific products, but specific information can be obtained by contacting us.

If you liked this post, you are most welcome to share it.