Document Clean-up – Part 1

Cleaning up legacy document backlogs

Uncontrolled documents - often on shared drives and in mailboxes - do not have the value they should have for the organization. At worst, they even harm, such as if it is personally identifiable information that appears to float illegally around a drive. How do you take life to control the masses?

It is nothing new that personally identifiable information should not lie and float on common drives and in mailboxes, but attention has probably been raised many places by the fact that the personal data law is undergoing renewal. The need for order in documents and data is also driven by other things, for example, Big Data projects are prominent at the present time. If you have built a big backlog, for example, typically on a common drive, it's a big task to clean up. There are different types of technology that can support the process. In some blog posts, we will look at the possible use of different technologies to relieve clean-up. We will to look at some of the technologies that you typically already have and how they can be used, and on some of the advanced technologies that are specifically designed for the purpose.

Part 1: Tips and tricks with Windows

Windows Search

When it comes to organizing documents, it's always nice to search for them. If you have the documents in the Windows system, you have at least Windows Search at your disposal. Windows Search can first and foremost search for metadata such as filenames. In fact, Windows can also search the text of documents, also called a full-text search, and it's very useful. Here is illustrated how Windows finds our file with the filename 'Stuff', but also another file with the word stuff in the text.

Typically, Windows will be set up so you can full text search in Powerpoint, Excel, Word, TXT files and some other formats, but it is possible to turn the feature on to additional formats.

On one's own local PC you can turn it through Indexing Options, as illustrated in the image below. We have followed the instructions from Windows here from Indexing Options and downloaded a filter for full text indexing of PDF files and turned it on.

An indexing of the files on our hard drive begins and at a time when the index is complete, we can search the contents of PDF files, as illustrated above for a TXT file. However, it does require that there IS text in the PDF file. Therefore, please read OCR processing in Part 2 of this series.

It is of course rare that the document backlog you have to clean up is on your own PC - in many cases, the masses are rather on a common drive. So it is on the common drives / file shares that the above filters must be turned on and the indexing must run. This can be done on file servers, but keep in mind that it requires some patience to wait for the index to be built.

Windows Browse

When manually cleaning up documents one of the most time-consuming tasks is to open the documents to see what it really is, when you can not figure it out from the file name - so it's about avoiding this as far as possible.

In some cases, metadata information may help, but by default, only Filename, Modified Data, Type, and Size appear in the overview image. Right-clicking at the top of the browser window pops up a menu that allows to change it; here is the menu and we have added Date Accessed and Owner - both are often useful in a clean-up:


It's a little thing, which if you know it, just makes life easier. Similarly, the following is quite basic, but useful to know; Preview.

When scrolling through files, Windows can preview the document (for some formats) without having to open it. It just requires you to know how to enable it - as follows:

It can sometimes be extremely slow for Windows to do this, and some formats do not simply work for preview in Windows. Therefore, it is often worthwhile to convert their documents. That aspect is covered in Part 2 of this series.

If you liked this post, you are most welcome to share it.