I was struck by your tweet during the Legislative Data Transparency Conference from a couple of weeks ago: “If the gods reached down and banished all PDFs from the face of the earth, I’d be fine with that. #ldtc”
Maybe I don’t understand your tweet. If you believe legislative docs shouldn’t be rendered in PDF, how should they be rendered? For legislation, PDF serves a critical function – it is a publicly available rendition of the legal paper document. The concerns I have with the idea of sole use of non-PDF formats (i.e., XML) are: (1) authenticity, (2) missing page and line numbering for navigation and amendment language, (3) access time to load XML+XSL over the web for large files, and (4) dropped or added text due to XSLT errors and updates. I’d prefer that you ask the gods to include the source documents in all government-created PDF files – regardless of source file format.
What’s your beef with PDF?
This came from a guy who has worked with legislative data much longer than I have, knows Congressional documents cold, and generally has an excellent idea of what he is doing. I have a lot of respect for his judgment. I have three practical responses, one impractical one, and a conclusion that actually agrees with his, though somewhat reluctantly:
1) The idea that authenticity is tied to a particular rendering format is simply wrong, though the use of PDF for this purpose is comforting for those who crave a format that is recognizably print-like. That’s been talked about in many places, notably in recommendations to the House of Representatives Bulk Data Task Force a couple of years back.
2) Missing page and line numbering is indeed a problem during a limited, strongly-bounded portion of the legislative process. I believe it has been solved elsewhere using XML (at least if my notes from the UN ICT and Parliaments meeting on open document standards, held at the House of Representatives in 2012, are correct, though they may not be). It is certainly possible to represent page and line numbers in XML, though it is awkward, difficult, and maybe impossible to round-trip XML to and from PDF if the PDF is altered.
But wait, why? Get rid of page and line numbering, because the new media don’t use pages. Judicial opinions are moving, however slowly, to paragraph numbering for purposes of citation, although those don’t supply the granularity needed for amendment — unless you make all of your amending processes use the paragraph as their minimum chunk. Better still would be to use a point-in-time drafting and publication system for the full lifecycle of the legislative drafting, passage, codification, and publication process. The Australian state of Tasmania has been doing this since 1997. Of course, this is the impractical response I mentioned. Congress is unlikely to change anytime soon (and I have been one of the loudest in saying that application developers are being unreasonable when they expect it to).
3) Access-time issues are a red herring. The data may bulk larger than PDF (I would want to see that tested over a wide range of samples before I’d buy it completely), but for viewing purposes you’d almost certainly pre-render it as HTML.
4) Problems with dropped or added text owed specifically to XSLT transformations are either a red herring or a truly frightening commentary. Anything wrought by human hands, especially when the power of the human hands in question is augmented by a computer, can be done badly. If we banned all technologies that have the power to alter or drop text from electronic publishing systems, we would have no publishing systems. If the argument is that XSLT is more likely to produce bad results than other technologies, I sort of agree, while still finding the risks acceptable. XSLT feels closer to a declarative programming language like Prolog (remember Prolog?) than anything else, it’s hard for procedural programmers to master, and yes, it can make a big mess. But there are lots of people who are quite comfortable using it in all sorts of bet-the-farm business operations, and there’s no reason to think it would do any worse in government.
But that’s not really what’s griping me, or him, come to think of it. He asks what my beef is with PDF. Here it is:
This is most certainly not a Congressional document, but it is a perfect example of what is wrong with PDF when it is put in risk-averse hands. It plainly got its start in life as a spreadsheet. And someone thought that publishing it as a spreadsheet, in a way that allowed parsing and repurposing of the data, would be somehow dangerous, so hey presto, PDF. More likely, they didn’t give it a moment’s thought; PDF is just how you do things. In this particular case, it’s all the more painful because the responsible party actually put data in the spreadsheet that would allow you to very usefully link its contents to the relevant parts of the CFR. So close, and yet so far.
So I guess I have two beefs with PDF, not just one:
a) It can’t be easily parsed or otherwise processed by computers. It locks up text in a way that prevents anyone but a human reader from doing anything with it (yes, I am aware that authenticity-scolds see this as a feature). That is not news.
b) It presents an almost irresistible temptation for the risk-averse, who see it as a safe, comforting format that is beautifully like their beloved print products and, best of all, prevents recompilation of data. The ability to repurpose data implies the ability to reconsider it in ways that might lead someone to question the conclusions drawn from it, or otherwise do something you might not like, and that is a very scary possibility.
My correspondent is absolutely right that the solution to a) is to publish in parallel formats — PDF for human readers, original manipulable electronic format for those who need to process, mark up, or otherwise work with the text or data in fluid ways. I wholeheartedly agree, and hereby amend my request to the gods. Fortunately, the gods do not use PDF for their Official Request Forms, so I am able to do so in place with no need to reissue it.
But I wonder how likely we are to see a solution to a) given that we have no solution to b), and apparently no hope of one. I would like to believe that a few solid projects and demonstrations would bring people to the point where they would make proper use of PDF, because they had somehow seen enough functionality in non-PDF environments to change their minds about the risks.
I really have no idea whether that is possible without banning PDF altogether. It is an attractive format for those who find print comforting, and for those who worry that somehow allowing any use of data will result in some form of misuse. It is, in fact, too attractive. We have nearly 20 years worth of demonstrations that publishing data in formats that allow repurposing is a really, really good idea that saves money and promotes innovation, and there has been no response. That leaves me unsure whether we can get everyone on board with that idea without making it impossible for them to do anything else, because we have seen far too little of the change that would validate a just-build-it-and-they-will come strategy.
So, sure, I agree. Dual-format release is the way to go, I guess, at least in the areas where PDF can justify its existence. But I would rather that the burden be placed on PDF to prove its worth, rather than on open data formats. Peter Drucker once famously recommended that a large company begin cleaning up its management practices by banning all reports and only allowing individual reports to return if the author could provide compelling written justification for them. That might not be a bad way to go.