It’s time we came to grips with the fact that not every “document” can be a “web page.” Some forms of writing just cannot be expressed in HTML—or they need to be bent and distorted to do so. But for once, XML might actually help.

The creation myth of the web tells us that Tim Berners-Lee invented HTML as a means of publishing physics research papers. True? It doesn’t matter; it’s a founding legend of the web whose legacy continues to this day. You can gin up as many web applications as you want, but the web is mostly still a place to publish documents.

The web is replete with projects to “digitize legacy content”—patent applications, books, photographs, everything. While photographs might survive well as JPEGs or TIFFs (disregarding accessibility issues for a moment), the bulk of this legacy content requires semantic markup for computers to understand it. A sheet of paper provides complete authorial freedom, but that freedom can translate poorly to the coarse semantics of HTML. The digitization craze—that’s what it is—crashes headlong into HTML semantics.

Some documents cannot be published using HTML. In many cases, we shouldn’t even bother trying. In other cases, we have to radically change the appearance and structure of the document. Ideally, we’ll start using custom XML document types—which, finally and at long last, might actually work.

An example of the conundrum of transferring print documents to the web, one that has become legendary in some circles, is the film screenplay.

A lot of people want to write a screenplay. The outcomes for most of these writers are the same: Nobody films and releases their movie. And they all go through the same phase—learning the generations-old “style” of screenplay formatting.

Typewritten screenplay from Die Hard 2.

Originating in the typewriter age, screenplay layouts are custom-engineered so that one printed page (in what we now call U.S. letter size) equals almost exactly one minute of onscreen time. Since most commercial movies run about two hours in length, typical Hollywood movie scripts are 118 to 122 pages long.

Typography is lousy; old typewriter fonts of yesteryear were errantly mapped onto today’s spindly Courier type. But as an example of document engineering, scripts are brilliant.

There’s an entire science involved in text indention. Text is rarely, if ever, “centered”; everything lines up at a tab stop, a concept that CSS expunges from the collective memory. (You could set left margins using the ch unit in CSS3, but nobody does.)
With careful alignments like these, it’s easy to scan down a screenplay page. Semantic use of ALL CAPITALS aids scanning, and clearly does not live up to the purely mechanical name CSS gives it, “text-transform.”

And now people want to transfer the format—intact—to the web. It’s not going to work.

Web “pages” may be called that, but the term is metaphorical. It has nothing to do with sheets of paper that equate to screen time. (Right away that means a shooting script’s many headers and footers would disappear, since we’re dealing with only one “page.”)
Nobody seriously intends screenplays on the web to have the same function they do in real life—getting read, getting optioned or bought, and getting shot. All of that happens on paper, not on Firefox.
HTML (per se) is not extensible. Extensible HTML (XHTML) has really not been extended. Hence the following truism is not going to change: HTML does not have enough tags for the semantics of screenplays, where nearly everything needs its own tag.
- Dialogue seems to be no problem, but dialogue is intermingled with screen and actor instructions, and in HTML both of those would just be placed in paragraph elements—even though the function, and expected appearance, differ drastically.
- What about the myriad headings, including the names of people speaking and notations for the time of day and the manner of speech (often called slugs or sluglines)? We have “a lot” of heading tags in HTML—six of them—but they are arranged hierarchically, not according to function. Would class names really suffice here—that is, H2 class="slugline" versus H2 class="charactername"? Really, the answer is no. Script headings and HTML headings are two different things.
The real movie industry doesn’t need HTML in the first place; it already has viable electronic exchange formats for scripts.
1. One is the proprietary format of Final Draft, the software that dominates the screenplay market the way MS Word dominates in offices. Open-source fanatics may look at this as one more delicious chance to inveigh against a proprietary format, but screenwriters have better things to worry about than open source. Anyway, Final Draft 8’s default document format is now XML.
2. The other is PDF. The movie business doesn’t have to care about accessibility, so even PDFs to which no accessibility features have been added suffice for script exchange. You don’t need tagged PDF, which also doesn’t have enough semantics for screenplays. (You could, in theory, write your own PDF tags, since they’re just XML.)

The quest to adapt scripts to the web recalls other “category errors,” to use Martin Amis’s phrase. Electronic commerce, we eventually figured out, does not take the form of “shopping malls” you “walk” through. “Magazines” and “catalogues” do not have discrete pages you flip (complete with sound effects) and dog-ear. “Web sites” do not look like magazine layouts, complete with multicolumn text and callouts.

Tellingly, this quest recalls early television, which, conventional wisdom holds, behaved more like filmed stageplays. Bringing scripts to the web is noticeably worse than filming a stageplay.

Now, people have tried to make web pages look exactly like typewritten screenplays. The star of this show is screenwriter and inveterate blogger John August. Scrippets, August’s plug-in for WordPress, Blogger, and other systems, does everything it can to spin straw into gold. Among other things, one of August’s use cases is perfect “screenplay” formatting when viewed in an RSS reader, and the only way to make that happen is through presentational HTML and inline styles. These are, of course, outmoded development methods.

August pitches his project thus (emphasis added): “With Scrippets, you can add boxes of nicely-formatted script to your blog.” That’s actually a restatement of the problem—failed reliance on a page metaphor, failed efforts to duplicate typewriter typography, and failed attempts to replicate one-page-per-minute layout. Script formatting is “nice” for print, but it’s wrong for the web—even for “little boxes” of script content.

Worse, Scrippets ignores whatever small contribution HTML semantics can offer in marking up a screenplay. Pretty much everything gets marked up as paragraphs, but not everything is a paragraph. This is a worse sin than loading up H2s with class names in an uphill battle to notate screenplay semantics.

The screenplay solution

The way to adapt scripts for the web is through cosmetic surgery. And we have a precedent for it. There’s a healthy market for screenplays published in book form. In fact, “the shooting script” is an actual U.S. trademark (from Newmarket Press) for one series of book versions of movie screenplays.

Some books just reprint typewritten screenplays at reduced size. This may make you feel like a pro, but what you should feel is cheated: You’re paying good money to read an author’s typewritten manuscript. Spindly Courier looks even worse in reduced size.
Other books completely redesign typewritten screenplays into a design native to book publishing. In a typical layout, speaker names are run inline with dialogue, normal book margins are used, and there’s a huge compaction of vertical whitespace. Typewritten screenplays read quite well in their intended context—but so do screenplay books in their context. (Retypeset scripts have also been used as language-learning aids.)

Hence to adapt this existing printed form to the web, you have to abandon all hope of duplicating original typescript formatting. You have to design something native to the web, with its relatively weak semantics and pageless or single-page architecture.

You could use HTML definition lists to mark up dialogue—explicitly permitted in (W3C-brand) HTML, explicitly banned by Ian Hickson under HTML5. (There, use DIALOG instead, even though the descendants of that tag, DT and DD, are the same descendants DL has.)
You can use PRE to fake indention and line breaks (but you can’t fake the division of a script into pages).
You can disregard text indention and just use CENTERed text.
You could, without too much of a stretch, mark up a script as a table.
You could just not bother too much with semantics, run character names (in bold or STRONG) inline with dialogue, and use HTML headings where feasible.

Other print formats that need transformation

Mastheads: The list of who does what at a magazine or newspaper is actually semantically complex, because each person’s title or the department they work in seems to be a heading. But a masthead marked up with H1 through H6 essentially pollutes the tag stream of the surrounding web page.
Callouts and sidebars: These structures, familiar from magazines, newspapers, and nonfiction books, cause serious confusion in creating a functioning document tree. (At what exact point in the tag stream are you expected to read the callout or sidebar?)
Footnotes: There isn’t a structure for footnotes in HTML (though there is in tagged PDF). Developers have tried all sorts of hacks, including JavaScript show/hide widgets and various rats’ nests of links and reverse links. For literature fans, HTML’s lack of footnotes makes the work of the late David Foster Wallace functionally impossible to render on the web (especially his footnotes within footnotes).
Charticles: With origins commonly attributed to Spy, a charticle is an illustrated featurette with a lot more accompanying text than what a bare illustration has. By way of comparison, a Flickr photo festooned with notes is functionally identical to a charticle, but HTML has no semantics for it.
Math and science: Yes, that old chestnut. Before you exclaim “MathML!” the way a pensioner might yell out “Bingo!,” understand that barely anybody uses MathML on real web pages due to serious authoring difficulty—physicist Jacques Distler remains among the very few who do.

Armed with this knowledge, what are we going to do? Prediction: nothing. People will continue to fake the appearance of scripts and use John August–caliber presentational code. But we do have an alternative.

The case typified by screenplays is merely a new variation of the difficulty of encoding literature in XML. People have tried it time and time again over the years, but barely any DTD has gotten traction. People just want to mark up everything in HTML (which has staying power). Ill-trained authors mark up everything as a paragraph or a DIV.

People seem to have taken the catchphrase “HTML is the lingua franca of the web” a bit too literally. HTML derives from SGML; XHTML is XML in a new pair of shoes. That’s four kinds of markup right there, but everybody acts as though there is only one kind, HTML. (Most of the time, browsers act like XTHML is HTML with trailing slashes.) Even electronic books are marked up as HTML, as the ePub file format is essentially XHTML 1.1 inside a container file—but that makes ePub files simultaneously HTML and XML. If we can spit those out, why can’t we spit out other kinds of XML?

We are well past the stage where browsers could not be expected to display valid, well-formed XML. Browsers can now do exactly that. Variant literary document types could actually work now. But because they languished on the vine for so long, now it seems nobody wants to make them work. After all, isn’t our new future wrapped up in HTML5? Just as our old future was wrapped up in XHTML2?

The web is, of course, a wondrous thing, but its underlying language lacks the vocabulary to express even the things that humans have already expressed elsewhere. We ought to accept that some documents have to be reformatted for the web, at least if the goal is using plain HTML. To give web documents the rich semantics of print documents, XML is finally a viable option.