Document Clean-up – Part 2

Cleaning up legacy document backlogs

Uncontrolled documents - often on shared drives and in mailboxes - do not have the value they should have for the organization. At worst, they even harm, such as if it is personally identifiable information that appears to float illegally around a drive. How do you take life to control the masses?

It is nothing new that personally identifiable information should not lie and float on common drives and in mailboxes, but attention has probably been raised many places by the fact that the personal data law is undergoing renewal. The need for order in documents and data is also driven by other things, for example, Big Data projects are prominent at the present time. If you have built a big backlog, for example, typically on a common drive, it's a big task to clean up. There are different types of technology that can support the process. In some blog posts, we will look at the possible use of different technologies to relieve clean-up. We will to look at some of the technologies that you typically already have and how they can be used, and on some of the advanced technologies that are specifically designed for the purpose.

Part 2 - Document convertersion and OCR


In part  1 of this series, we argued that it is smart to be able to search documents when cleaning up. But what do you do with a file that is not searchable? And what was it  all about with those PDF files - are they searchable or what?

With regard to PDF files, the answer is "it depends ..". A PDF file can be with searchable text or it may be without. If it is not, it can be treated with OCR on an equal footing with other file formats, as described below. But if you do not know if a PDF file is searchable or not, an easy way to look at it is simply to see if you can mark text. If possible, it is searchable.

If you scan a document, you will first get an image file created by the scanner. With software you can then - and often the software is part of the package you get with the scanner - search the file for letters, recognize them and thus form a text layer in the file so it now becomes searchable. The process is called Optical Character Recognition (OCR) and is a technology that has been available for a very long time. Now, OCR tools are available that can handle the most incredible things with very high precision, such as recognizing handwriting and recognizing words even though they are written both vertically and horizontally on the paper and recognize words in poor quality scans.

If you already have a large, expensive multifunction machine (copy / scan / print) you have such software at your disposal with great certainty. Perhaps a call to the provider of the machine is required to getting it set up so that you can also run files from your shared drive through OCR processing. Last a customer wanted this, it cost less than 1.000 euros and then they were running. However, there is a big difference between the quality of OCR software and the software embedded in the copier may not meet the needs. Then there are both online services and really nice solutions for local use. The solutions we use most places are really document conversion solutions (see left column) in which OCR is embedded.


Document conversion

If documents are not searchable, it is a good idea to convert documents into a searchable format. But in addition, what does one gain from converting documents in connection with the clean-up?

If you clean up manually, you typically click through the files and (as described in Part 1 of this article series), you can preview the document's cover page in Windows' preview pane. For some file types, it goes well and fast - for others it does not go at all. If all documents are converted to PDF, it runs well and uniformly, and you need to open a file to see more, it's the same application that is needed all the time (for example, Adobe Reader) and it can be left open. It saves a lot of time not having to open Word for one file, Excel for another, etc.

But most importantly, what you clean up - and later leave in your archive - is in a format that can actually be read. We have repeatedly seen customers have made a huge effort to archive, but 15 years later, they have files that are no longer available because the software to read it is no longer available or it is left on a media that can no longer be opened. For some, a good solution is to convert everything to PDF, possibly PDF / a. For others, a good solution is constantly to "upconvert" the documents, for example, the conversion engine converts old Word files one version up each time the company changes the Word version.

Thus, the point we wanted to make was two things, namely 1) it makes it easier to manually clean up if the documents are converted to a uniform format and 2) there is a long-term point in converting documents, so that it is not only for the sake of clean-up that we propose to consider a document conversion.

