Subscribe Now!

Structured Authoring and the Evolution of XML Standards - Two Perspectives on XML

Plan Your DITA XML Output Strategy First

By Max Dunn

I have been working with XML since it was a glimmer in the eye of Jon Bosak. In fact, before XML was conceived, there was SGML; going from SGML to XML represented a streamlining for the web, but at its core there was not much functional difference; in fact XML is a subset of SGML. The key concept of semantic markup is central to the core value of SGML/XML.

The two main perspectives I have seen are Document-centric XML and Data-centric XML. SGML initially appeared in support of document-centric work: managing all the technical documents or contracts of IBM or Boeing, for example. Charles Goldfarb has maintained that “SGML literally makes the infrastructure of modern society possible” and I think he’s right – hmm, should we blame him for the lengths to which humans have gone to destroy the earth?

The document-centric XML world is really a direct continuation of SGML. When XML came out as a standard in 1998, those of us working with document-centric XML became giddy with excitement, anticipating that the standards being proposed at the time (notably XML itself, XLink, XML Schema, RDF, XSL and pre-cursors to SVG) would finally facilitate tools that made publishing work for organizations that weren’t quite as big as IBM or the Department of Defense. The vision of a semantic web and ubiquitous XML multi-channel publishing, seemed to be growing a foundation in theories gaining critical mass, with apparent support of software companies. It appeared these vendors might actually adopt the standards of the committees they were sitting on. “Throw away Xyvision!” I told my boss at Bertelsmann, “this XSL-FO will completely revolutionize database publishing!”

We were sorely disappointed over the next five years. In the years before 1998 W3C standards seemed magical; concepts from the standards were implemented relatively quickly, without perfection but with steady progress: browser updates would reflect CSS and HTML advances; even Microsoft was shamed into some level of compliance. But the monopolistic tendencies of those on the standards committees, coupled with the academic approach of some of the standards committees, managed to make it less and less likely that a given standard would find a functional implementation.

And there was that other perspective – the data-centric side of things. For many reasons, XML was at the right place at the right time in terms of data management and information exchange. In fact, the very year that XML became a standard, it also became the dominant way that machines (servers) talked to each other around the world. Highly convenient for exchanging info, as firewalls would tend to block anything but text over http, while XML markup would allow any sort of specification for data structures, and validation tools would ensure no info was lost.

In 1998, when you asked a programming candidate “what do you know about XML?” only the document-centric people would know anything. By 2000, everyone doing any serious programming “knew” about XML. Trouble was, they typically knew about “XML” only in the much easier-to-use, irrelevant-to-publishing, sense.

And the standards now had to accommodate two crowds. The work of the W3C XML Schema Working Group, in particular, showed the disconnect. Should a schema be easily human readable? What was the primary purpose of Schema? Goals were not shared by the document- and data-centric sides, and data-centric won out, as they have tended to dominate the XML space ever since that time. RELAX NG came about as an alternative, and if you contrast RELAX NG with W3C Schema, you will see the contrast between the power of a few brilliant individuals aligned in purity of purpose and the impotence of a committee with questionable motives and conflicting goals. Concurrent with a decline in the altruism of committee participants was the huge advance of data-centric XML and the disproportionate representation of that perspective.

Ten years later, we find in the document-centric world that toolsets related to XML in a data sense – parsing, transforming, exchanging info – have made great leaps forward, but we are in many ways still stuck in the 1990s in terms of core authoring and publishing technologies. It is telling that descendants of the three great SGML authoring tools as of 1995 – FrameMaker+SGML, Arbortext Epic, and SoftQuad’s Author/Editor, are, lo and behold, the leading three XML authoring tools in 2009.

There have been some slow-paced advances in document-centric XML standards and tool chains as well, especially the single bright light out there for us, Darwin Information Typing Architecture (DITA) which came out of IBM like XML itself. Yet standards for rendition, XSL-FO and SVG especially, have not advanced along with core proprietary rendition technologies such as InDesign, Flash, or Silverlight, though all of these enjoy nicely copied underpinnings pillaged from the standards. More important, nothing has stepped in to replace the three core authoring tools: the “XML support” of Microsoft Word and Adobe InDesign, for example, do not approach the capabilities of a true XML authoring application. There are a proliferation of XML “editors” but most of the new ones are appropriate for editing a WSDL file or an XML message (the data-centric forms of XML), not a full-fledged document.

Meanwhile, on the data-centric front, XML has simply permeated every aspect of computing. There are XML data types in database systems, XML features in most programming languages, XML configuration files at the heart of most applications, and XML-based Web Services available in countless flavors.

Document-centric XML is simply a deep challenge that will take more time (and probably more of a commercial incentive) to tackle. For the time being, structured authoring managed the XML way is still implemented mainly by very large organizations: such an approach has “trickled down” from organizations the size of IBM to organizations the size of Adobe (which does, in fact, use DITA now), but there are not tool chains yet available that will bring it down much further. The failure of the W3C XML Schema Working Group to provide a functional specification supporting document-centric XML can hardly be underestimated.

As long as content is not easily authored in a semantically rich, structured fashion, the vision of the semantic web will remain an illusion. When and if document-centric XML gets more attention from standards bodies and software vendors, human communications will become far more efficient and effective.

Max Dunn has been programming typographical output for over 15 years. In 2000, after 4 years directing programming at Bertelsmann Industry Services, Max founded Silicon Publishing, a development company specializing in document generation and dynamic web site automation.

TOP 3 Ph.D. Programs in Technical Communication in the United States

© Ugur Akinci

A full-fledged doctoral program that focuses exclusively on technical writing and communication is still a rare bird out in the market today.

Here are the top 3 Doctor of Philosophy (Ph.D.) programs in the United States on technical communication – the highest academic degree you can get as a technical writer and communicator.

TOP 3 Ph.D. Programs in Technical Communication

1) Illinois Institute of Technology

Name of the degree: “Ph.D. in Technology and Humanities” (used to be called “Ph.D. in Technical Communication”)

This is not a practical hands-on program that prepares technical writers for the real world of technical writing out there. However, if you’d like to be an academic and do research and teaching in a university setting in the field of technical communication then this might be what you are looking for. It is a 72-hour program beyond the bachelor’s degree. Students who have earned a master’s degree in a relevant field may transfer up to 30 credit hours.

For more info:

TOP 3 Ph.D. Programs in Technical Communication

2) University of Minnesota

Name of the degree: “Ph.D. in Rhetoric and Scientific and Technical Communication (RSTC)”

This is also a research degree aimed at preparing technical writing students for faculty positions. It is not designed to prepare technical writers to work in government and private sector.

Ph.D. candidates must earn a minimum of 42 credits in coursework, at least 27 of which must be taken in Writing Studies classes and seminars.

For more info:

TOP 3 Ph.D. Programs in Technical Communication

3) Texas Tech University


The program provides a broad base to conduct research in technical editing, design, rhetorical theory, online documentation, publications management, and usability testing.

60 credit hours are needed to complete the program.

For more info:

STC INDIA Presentation: Working for a Software (SW) Company

Working for a Software Company

Click  below to download the presentation slides in PDF format: