We have at Strator made it a specialty to convert documents for example for delivery to the Danish National Archives. Converting documents is just opening the file and printing it - not on paper, but in the desired file format. This is well-known in for example Word when chosing to Save As PDF - the document is then converted to PDF. How could this be requiring experienced people like us to do so? The devil is in the detail, and that is a significant part of the explanation. Another part of the explanation is sheer volume. There are typically hundreds of thousands or millions of files that are about. Following here is an insight into some of the details that will surprise you, if you have not tried converting before.
Formats - TIFF ins not just TIFF
DokumenterDocuments can contain only black and white text, but it may as well be a colorful document with detailed photographs. The archives want to hand over the documents so that they reproduce the originals well enough, but at the same time do not fill more than necessary. It requires balance. Therefore, a black and white text document must be delivered in the form of a TIFF 1-bit file, while a colorful document must be handed over as a TIFF 24-bit file. How colorful a document is represented by what is called the bit depth of the document, and thus it is necessary to make clear to each document the bit depth the document has in order to select the appropriate format.
It may happen that TIFF is not the optimal format. Clearly, the need for audio files, videos and technical drawings, but also very detailed images, must be reproduced differently if the detailing is significant, and in these cases, the National Archives has specified what to do.
In addition to converting files to delivery format, they must be compressed to fill the least possible - without losing information. Again you have to find the right balance and choose the right compression.
When the document is converted, it is a requirement that data is not lost in the process. This does not mean the risk of forgetting page 4, because of course it does not happen. It's a lot more subtle data loss you need to be alert to. A PDF document may have comments in the form of yellow post-its, which can cover some text so that after the reproduction it is not possible to see what is under the yellow note.
A spreadsheet can contain content in a cell that is cut because it exceeds column width. In a presentation, some pages may be hidden, and information may be stored in the notes. For that matter, there may also be hidden text in a document, hidden columns in a spreadsheet or notes in the form of audit comments. This kind of must be discovered, otherwise the submission is incomplete.
Mails are known to have attachments, but common office documents can also have embedded objects. Of course, these objects must also be unpacked and converted, as they also contain information relevant to the context. Even when we think we soon have to see everything, something is still coming up, which we have not encountered before, and from there we must develop another method of collecting solutions.
The big challenge - The Spreadsheet
The spreadsheet is and remains the biggest winner when it comes to encountering new complex issues. Again, conversion is really just to print a document. How often does a printed spreadsheet do as you want it? It may be that the page break cuts something off or there are 200 pages of single columns per page, which does not provide the best starting point for tying the document - readability is simply too bad.
In another file, zoomed out so that the print of a huge spreadsheet is squeezed down on a page - again unreadable. Last example - we can continue - is the situation with a formula that has been copied into a whole row, and although the spreadsheet is actually used only in the first 10 columns, there are data in all columns that are therefore printed but less you actively prevent it. For information, we can share that there are 16,384 columns and 1,048,576 rows in a spreadsheet - it can produce a very large print.
In bunks of the amount of files discussed here, corrupt files will be found - files where the extension of the file (.PDF, .doc, .xlsx, etc) does not match the actual content - there will be password protected files that can not be is opened by others to the password owner and there may even be virus-protected files. There may also be files in formats that have been deleted several years ago (eg WordPerfect 4.2).
Eventually, many of these unforeseen challenges can be made a big deal, but in the end a corrupt file is unmanageable and we're not hacked in a company's password-protected spreadsheet from the finance department to get it converted. Therefore, it is essential to have a process that ensures that what falls and fails is collected and handled.
Conversion requires experience
It's basically a relatively simple process of converting documents. There are just a lot of practical issues combined with a huge amount of files. We have several proven commercial conversion tools that we will choose from depending on the situation and evaluate which will work best as a motor for a specific conversion task. Our tools have been set up to meet the requirements of the National Archives as far as possible. In addition, we have built a huge experience - both with tools and with the practical problems in the files. We have as many conversion issues as we encounter, as many solutions and tricks as we have.