{"id":154,"date":"2013-06-14T06:57:52","date_gmt":"2013-06-14T11:57:52","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/tbruce\/?p=154"},"modified":"2013-06-14T09:13:33","modified_gmt":"2013-06-14T14:13:33","slug":"pdf-re-sewing-the-blanket","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/tbruce\/2013\/06\/14\/pdf-re-sewing-the-blanket\/","title":{"rendered":"PDF: re-stitching the blanket"},"content":{"rendered":"

\"linus\"<\/a>A few weeks ago I made a characteristically intemperate remark via Twitter, which drew this response from a friend of mine:<\/p>\n

I was struck by your tweet during the Legislative Data Transparency Conference from a couple of weeks ago: \u00a0\u201cIf the gods reached down and banished all PDFs from the face of the earth, I\u2019d be fine with that. #ldtc\u201d<\/p>\n

\u00a0Maybe I don\u2019t understand your tweet. \u00a0If you believe legislative docs shouldn\u2019t be rendered in PDF, how should they be rendered? \u00a0For legislation, PDF serves a critical function \u2013 it is a publicly available rendition of the legal paper document. \u00a0The concerns I have with the idea of sole use of non-PDF formats (i.e., XML) are: (1) authenticity, (2) missing page and line numbering for navigation and amendment language, (3) access time to load XML+XSL over the web for large files, and (4) dropped or added text due to XSLT errors and updates. \u00a0I\u2019d prefer that you ask the gods to include the source documents in all government-created PDF files \u2013 regardless of source file format.<\/p>\n

\u00a0What\u2019s your beef with PDF?<\/p>\n<\/blockquote>\n

This came from a guy who has worked with legislative data much longer than I have, knows Congressional documents cold, and generally has an excellent idea of what he is doing. \u00a0I have a lot of respect for his judgment. \u00a0I have three practical responses, one impractical one, and a conclusion that actually agrees with his, though somewhat reluctantly:<\/p>\n

1) The idea that authenticity is tied to a particular rendering format is simply wrong, though the use of PDF for this purpose is comforting for those who crave a format that is recognizably print-like. \u00a0That\u2019s been talked about in many places, notably in recommendations to the House of Representatives Bulk Data Task Force<\/a> a couple of years back.<\/p>\n

2) Missing page and line numbering is indeed a problem during a limited, \u00a0strongly-bounded portion of the legislative process. I believe it has been solved elsewhere using \u00a0XML (at least if my notes from the UN ICT and Parliaments meeting on open document standards, held at the House of Representatives in 2012<\/a>, are correct, though they may not be). \u00a0\u00a0It is certainly possible to represent page and line numbers in XML, though it is awkward, difficult, and maybe impossible to round-trip XML to and from PDF if the PDF is altered.<\/p>\n

\u00a0But wait, why? Get rid of page and line numbering, because the new media don\u2019t use pages. \u00a0Judicial opinions are moving, however slowly, to paragraph numbering for purposes of citation, although those don\u2019t supply the granularity needed for amendment — unless you make all of your amending processes use the paragraph as their minimum chunk. \u00a0Better still would be to use a point-in-time drafting and publication system for the full lifecycle of the legislative drafting, passage, codification, and publication process. \u00a0The Australian state of Tasmania has been doing this since 1997<\/a>. \u00a0Of course, this is the impractical response I mentioned. Congress is unlikely to change anytime soon (and I have been one of the loudest in saying that application developers are being unreasonable when they expect it to).<\/p>\n

3) Access-time issues are a red herring. \u00a0The data may bulk larger than PDF (I would want to see that tested over a wide range of samples before I’d buy it completely), but for viewing purposes you\u2019d almost certainly pre-render it as HTML.<\/p>\n

4) Problems with dropped or added text owed specifically to XSLT transformations are either a red herring or a truly frightening commentary. \u00a0Anything wrought by human hands, especially when the power of the human hands in question is augmented by a computer, can be done badly. \u00a0If we banned all technologies that have the power to alter or drop text from electronic publishing systems, we would have no publishing systems. \u00a0If the argument is that XSLT is more likely to produce bad results than other technologies, I sort of agree, while still finding the risks acceptable. XSLT feels closer to a declarative programming language like Prolog (remember Prolog?) than anything else, it\u2019s hard for procedural programmers to master, and yes, it can make a big mess. \u00a0But there are lots of people who are quite comfortable using it in all sorts of bet-the-farm business operations, and there\u2019s no reason to think it would do any worse in government.<\/p>\n

But that\u2019s not really what\u2019s griping me, or him, come to think of it. \u00a0He asks what my beef is with PDF. \u00a0Here it is:<\/p>\n

http:\/\/www.aphis.usda.gov\/animal_welfare\/downloads\/violations\/2007violations.pdf<\/a><\/p>\n

This is most certainly not a Congressional document, but it is a perfect example of what is wrong with PDF when it is put in risk-averse hands. \u00a0It plainly got its start in life as a spreadsheet. \u00a0And someone thought that publishing it as a spreadsheet, in a way that allowed parsing and repurposing of the data, would be somehow dangerous, so hey presto, PDF. \u00a0More likely, they didn’t give it a moment’s thought; PDF is just how you do things. \u00a0In this particular case, it\u2019s all the more painful because the responsible party actually put data in the spreadsheet that would allow you to very usefully link its contents to the relevant parts of the CFR. \u00a0So close, and yet so far<\/a>.<\/p>\n

So I guess I have two beefs with PDF, not just one:<\/p>\n

a) It can\u2019t be easily parsed or otherwise processed by computers. \u00a0It locks up text in a way that prevents anyone but a human reader from doing anything with it (yes, I am aware that authenticity-scolds see this as a feature). \u00a0That is not news.<\/p>\n

b) It presents an almost irresistible temptation for the risk-averse, who see it as a safe, comforting format that is beautifully like their beloved print products and, best of all, prevents recompilation of data. \u00a0The ability to repurpose data implies the ability to reconsider it in ways that might lead someone to question the conclusions drawn from it, or otherwise do something you might not like, and that is a very scary possibility.<\/p>\n

My correspondent is absolutely right that the solution to a) is to publish in parallel formats — PDF for human readers, original manipulable electronic format for those who need to process, mark up, or otherwise work with the text or data in fluid ways. \u00a0I wholeheartedly agree, and hereby amend my request to the gods. \u00a0Fortunately, the gods do not use PDF for their Official Request Forms, so I am able to do so in place with no need to reissue it.<\/p>\n

\u00a0But I wonder how likely we are to see a solution to a) given that we have no solution to b), and apparently no hope of one. \u00a0I would like to believe that a few solid projects and demonstrations would bring people to the point where they would make proper use of PDF, because they had somehow seen enough functionality in non-PDF environments to change their minds about the risks.<\/p>\n

I really have no idea whether that is possible without banning PDF altogether. \u00a0It is an attractive format for those who find print comforting, and for those who worry that somehow allowing any use of data will result in some form of misuse. It is, in fact, too attractive. We have nearly 20 years worth of demonstrations that publishing data in formats that allow repurposing is a really, really good idea that saves money and promotes innovation, and there has been no response. \u00a0That leaves me unsure whether we can get everyone on board with that idea without making it impossible for them to do anything else, because we have seen far too little of the change that would validate a just-build-it-and-they-will come strategy.<\/p>\n

So, sure, I agree. \u00a0Dual-format release is the way to go, I guess, at least in the areas where PDF can justify its existence. \u00a0But I would rather that the burden be placed on PDF to prove its worth, rather than on open data formats. \u00a0Peter Drucker once famously recommended that a large company begin cleaning up its management practices by banning all reports and only allowing individual reports to return if the author could provide compelling written justification for them. \u00a0That might not be a bad way to go.<\/p>\n","protected":false},"excerpt":{"rendered":"

A few weeks ago I made a characteristically intemperate remark via Twitter, which drew this response from a friend of mine: I was struck by your tweet during the Legislative Data Transparency Conference from a couple of weeks ago: \u00a0\u201cIf the gods reached down and banished all PDFs from the face of the earth, I\u2019d […]<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/posts\/154"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/comments?post=154"}],"version-history":[{"count":6,"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/posts\/154\/revisions"}],"predecessor-version":[{"id":161,"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/posts\/154\/revisions\/161"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/media?parent=154"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/categories?post=154"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tbruce\/wp-json\/wp\/v2\/tags?post=154"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}