The US Department of Justice today quietly uploaded a new version of the Mueller Report to its website.
You would not notice a difference if you look at the report that can be downloaded here. What's new is a layer of data that ultimately provides access to the underlying text of the document.
When the Mueller report came out on April 18, it was essentially a huge image file. (At 140 MB, the file was more than 300 times larger than an e-book from Crime and Punishment .) The Department of Justice appears to have scanned a paper copy of the report – using a Ricoh MP C6502 Color Laser Multifunction Printer for what it's worth ̵
The decision immediately triggered a groan from people trying to search the report for juicy details. A huge image file has no text to search. It was also condemned by the group involved in defining technical specifications for the portable document format: "This deliberate and unnecessary act made the document forever more difficult for anyone and everyone," wrote Duff Johnson, managing director of the PDF Association , in a delightful overview of the nerdiest details of the file.
News organizations and miller fanatics have quickly solved this problem by running the PDF through a process known as optical character recognition (OCR) to add searchable text to the document. To review it again, the Müller report was written on a computer, then printed on paper, scanned back into digital images, and finally converted into text using software.
As you might imagine, this flip-flop had some shortcomings. Redactions in the report puzzled most OCR software and made some of the text unreadable, Johnson pointed out in a follow-up post.
The Justice Department image-only PDF also appeared to violate the US government's own guidelines for making documents available to all readers. If PDFs are not equipped with a layer of text and other metadata, "People with disabilities using assistive technologies such as screen readers or voice output tools may find it difficult or impossible to access critical or critical information," the agency said. who is responsible for such things.
The website for a lawyer's investigation into Robert Mueller confirmed this shortcoming: "The department acknowledges that these documents may not yet be available in an accessible format" – and offered to send a text file of the report to people who would have trouble reading it. It is not clear if someone has requested or received such a file. In any case, this offer was removed from the Legal Adviser's website today when the new version of the PDF document was uploaded.
Johnson was informed today about the updated PDF and was both impressed and disappointed.
"They really tried to do a good job here trying to make it accessible," he said in a telephone interview. "Unfortunately, there are still a lot of mistakes in the tags."
The software used by the Department of Justice for the report's OCR creation, Adobe Acrobat, produced some confused or incomplete text in the new PDF file, especially in large editorial offices and photos. And many invisible markers designed to make the document more accessible have been misused.
The file still contains a 140 MB image collection, though text is now below it. A never-scanned PDF of the report's native text would likely be less than 5 MB.
"They made a serious effort to improve the report," said Johnson, who worked in politics in the 1990s before he came into the nerd world of document formats. He was rightly proud that his analysis of the earlier PDF document had received such attention and apparently caused the Justice Department to try again, and he promised to write another paper analyzing the new efforts.
But one of the most important PDFs in American history still has some secrets.
"It remains a scanned file," Johnson noted, pulling the document up on his computer. "Why is it scanned at all? One possibility is that it was received by Müller on paper. That's weird if it's true. Why does not Müller send a digital file? Why should not DoJ say "Excuse me, can you send a PDF file?". "