It’s time we came to grips with the fact that not every “document” can be a “web page.” Some forms of writing just cannot be expressed in HTML—or they need to be bent and distorted to do so. But for once, XML might actually help.

The creation myth of the web tells us that Tim Berners-Lee invented HTML as a means of publishing physics research papers. True? It doesn’t matter; it’s a founding legend of the web whose legacy continues to this day. You can gin up as many web applications as you want, but the web is mostly still a place to publish documents.

The web is replete with projects to “digitize legacy content”—patent applications, books, photographs, everything. While photographs might survive well as JPEGs or TIFFs (disregarding accessibility issues for a moment), the bulk of this legacy content requires semantic markup for computers to understand it. A sheet of paper provides complete authorial freedom, but that freedom can translate poorly to the coarse semantics of HTML. The digitization craze—that’s what it is—crashes headlong into HTML semantics.

Some documents cannot be published using HTML. In many cases, we shouldn’t even bother trying. In other cases, we have to radically change the appearance and structure of the document. Ideally, we’ll start using custom XML document types—which, finally and at long last, might actually work.

An example of the conundrum of transferring print documents to the web, one that has become legendary in some circles, is the film screenplay.

A lot of people want to write a screenplay. The outcomes for most of these writers are the same: Nobody films and releases their movie. And they all go through the same phase—learning the generations-old “style” of screenplay formatting.

Screenplay

Typewritten screenplay from Die Hard 2.

Originating in the typewriter age, screenplay layouts are custom-engineered so that one printed page (in what we now call U.S. letter size) equals almost exactly one minute of onscreen time. Since most commercial movies run about two hours in length, typical Hollywood movie scripts are 118 to 122 pages long.

Typography is lousy; old typewriter fonts of yesteryear were errantly mapped onto today’s spindly Courier type. But as an example of document engineering, scripts are brilliant.

And now people want to transfer the format—intact—to the web. It’s not going to work.

The quest to adapt scripts to the web recalls other “category errors,” to use Martin Amis’s phrase. Electronic commerce, we eventually figured out, does not take the form of “shopping malls” you “walk” through. “Magazines” and “catalogues” do not have discrete pages you flip (complete with sound effects) and dog-ear. “Web sites” do not look like magazine layouts, complete with multicolumn text and callouts.

Tellingly, this quest recalls early television, which, conventional wisdom holds, behaved more like filmed stageplays. Bringing scripts to the web is noticeably worse than filming a stageplay.

Now, people have tried to make web pages look exactly like typewritten screenplays. The star of this show is screenwriter and inveterate blogger John August. Scrippets, August’s plug-in for WordPress, Blogger, and other systems, does everything it can to spin straw into gold. Among other things, one of August’s use cases is perfect “screenplay” formatting when viewed in an RSS reader, and the only way to make that happen is through presentational HTML and inline styles. These are, of course, outmoded development methods.

August pitches his project thus (emphasis added): “With Scrippets, you can add boxes of nicely-formatted script to your blog.” That’s actually a restatement of the problem—failed reliance on a page metaphor, failed efforts to duplicate typewriter typography, and failed attempts to replicate one-page-per-minute layout. Script formatting is “nice” for print, but it’s wrong for the web—even for “little boxes” of script content.

Worse, Scrippets ignores whatever small contribution HTML semantics can offer in marking up a screenplay. Pretty much everything gets marked up as paragraphs, but not everything is a paragraph. This is a worse sin than loading up H2s with class names in an uphill battle to notate screenplay semantics.

The screenplay solution

The way to adapt scripts for the web is through cosmetic surgery. And we have a precedent for it. There’s a healthy market for screenplays published in book form. In fact, “the shooting script” is an actual U.S. trademark (from Newmarket Press) for one series of book versions of movie screenplays.

Hence to adapt this existing printed form to the web, you have to abandon all hope of duplicating original typescript formatting. You have to design something native to the web, with its relatively weak semantics and pageless or single-page architecture.

Other print formats that need transformation

Armed with this knowledge, what are we going to do? Prediction: nothing. People will continue to fake the appearance of scripts and use John August–caliber presentational code. But we do have an alternative.

The case typified by screenplays is merely a new variation of the difficulty of encoding literature in XML. People have tried it time and time again over the years, but barely any DTD has gotten traction. People just want to mark up everything in HTML (which has staying power). Ill-trained authors mark up everything as a paragraph or a DIV.

People seem to have taken the catchphrase “HTML is the lingua franca of the web” a bit too literally. HTML derives from SGML; XHTML is XML in a new pair of shoes. That’s four kinds of markup right there, but everybody acts as though there is only one kind, HTML. (Most of the time, browsers act like XTHML is HTML with trailing slashes.) Even electronic books are marked up as HTML, as the ePub file format is essentially XHTML 1.1 inside a container file—but that makes ePub files simultaneously HTML and XML. If we can spit those out, why can’t we spit out other kinds of XML?

We are well past the stage where browsers could not be expected to display valid, well-formed XML. Browsers can now do exactly that. Variant literary document types could actually work now. But because they languished on the vine for so long, now it seems nobody wants to make them work. After all, isn’t our new future wrapped up in HTML5? Just as our old future was wrapped up in XHTML2?

The web is, of course, a wondrous thing, but its underlying language lacks the vocabulary to express even the things that humans have already expressed elsewhere. We ought to accept that some documents have to be reformatted for the web, at least if the goal is using plain HTML. To give web documents the rich semantics of print documents, XML is finally a viable option.