http://www.dmst.aueb.gr/dds/pubs/jrnl/2005-IEEESW-TotT/html/v25n2.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
|
Tools of the Trade
Using and Abusing XML
Diomidis Spinellis
Words are like leaves; and
where they most abound,
Much fruit of sense beneath is
rarely found.
— Alexander Pope
I was
recently gathering
GPS coordinates and cell identification data, researching how to obtain results like the algorithms
hiding behind Google’s
“My Location” facilityexperimented
with gathering GPS coordinates and phone cell identification data.[1] researching how one could obtain
results similar to those of Google’s “My
Location” facility. //au: Do you mean “Saved Locations” or “My Saved
Places” in Maps?
I don’t see a “My Location” option
in Google Maps. And, do you mean “Google” or “Google Earth”?// While
working on this task,
I witnessed once again the great interoperability
benefits we get from the use of XML. With
a simple 140-line script,
I was
able to converted
the data I gathered into a de -facto standard, the XML-based GPS GPS-exchange format
called GPX. Then, using a GPS-format converter of various GPS formats,
I converted my data into Google
Earth’s the XML data format of Google Earth. A
few mouse clicks later,
I could
view had my
journeys and the associated cell tower switchovers
beautifully superimposed on satellite pictures and maps.
XML is an extremely nifty format. A major
strength of Computers
can easily parse XML data is that computers can easily parse it,
yet humans can also understand it. For instanceexample, a week ago a UMLGraph user complained
that pic2plot clipped elements from the scalable vector graphics (SVG—another
XML-based format) file it generated. I was able to suggest a workaround by that modified modifying the
picture’s bounding box, which was clearly visible as two XML tag attributes at
the top of the file.
Furthermore, a simple tool can trivially determine if an XML
document is well formed (meaning that it follows XML’s rules). In additionAnd, if we have at hand the
document’s schema (a formal description of a specific document’s the allowed
composition,
of a
specific document, like such as GPX), we can validate that a given file follows
the schema. These properties are a boon to interoperability. With
the XML schema at hand,
when we stumble across a data transfer problem between two applications, we don’t need to
quarrel about whose program’s fault it is. A
third party, an XML validator, can judge whether the data follows the schema,
and thereby
impartially assign the fault to the data’s producer or the consumer of the data.
XML also provides gives our code with more robust
input handling. Input processing is a notorious source of bugs, because
there are literally infinite ways to provide wrong input to a program. These daysMoreover, the situation is even worse, because malicious
adversaries deliberately craft input data aiming to crash a our program, or,
worse, gain and exploit its privileges. By
using XML, we can solve this problem,
if we by relying on the widely
available libraries for parsing our input. These
libraries are, by design and through their ubiquitous deployment, a lot much more resilient resistant to abuse than
any special-purpose code we wcould concoct on our own.
Finally, by adopting XML, we can take advantage of the scores of tools that work on arbitrary XML documents. Common tasks, like editing, validation, transformations, and queries, are then just a matter of selecting and applying the right tool. Also, we can then apply the experience we gain with these tools on other documents we come across in our work. And if, like me, you’re a devoted user of the Unix toolchest, have a look at XMLgawk.[2] It manages to combine gracefully exactly what its awkward name suggests.
When we use XML, we sacrifice (sometimes significant)
processing time and space to gain interoperability. ThereforeSo, it makes sense to
actually verify that we’ve achieved our goal. Once
you come up with a schema, ensure that you have at least one independently
written program to read and write data in that schema. In aAdditionally, have a human edit
the file,
and verify that its structure is unintuitive //au:
OK?//
to someone unfamiliar with the schema, and that the programs
can still read and process the edited file. Also,
formally document your schema in a schema language, such as RELAX NG or XSD (XML Schema Definition),
and then have a
third-party tool validate your XML files. with a
third-party tool.
Another way to promote interoperability is to adopt existing
schemas. You
can do that either in a wholesale fashion, by
having your application read and write its data in an already existing schema,
for instance SVG, or piecemeal, by having parts of your XML document follow
widely -adopted
standards. As anFor example, the schema for GPX uses the XML
Schema xsd:dateTime data type for time
stamping waypoints. In
turn, Tthis data type is in turn precisely
defined by reference to ISO
8601, the
international standard for date and time representations, ISO 8601. Such an
approach allows This approach lets you to reuse large
swaths of existing work, and avoids troublesome ambiguities. One of the A criticisms
against
of the
Office Open XML file format is exactly that it doesn’t
use existing standards for many of the elements it represents, such as (you
probably guessed it) dates and but also math
and drawings.
AlsoFurthermore,
try to make your program’s XML output accessible to non-XML tools and humans. Specifically,
if your data consists of records up to, say, 80 characters long, fit each one
on a single line. This allows lets many line line-oriented- tools, like Unix’s wc,
awk, sed, and grep to process your data. In
more complex files, use appropriate indentation to make the file’s structure
apparent to its human viewers.
Words are like leaves; and where they most abound,
Much fruit of sense beneath is rarely found.
— Alexander Pope
//au:
Given our column widths, putting the
quote here would
probably be awkward.
It might wrap and the formatting we have for these type of quotes would probably
look odd as well. Could we move it to the
top of the column?//
By far,
the worst offence I’ve seen in the take-up of XML,
is its adoption as a format for human-produced code. Three
representative examples are the Apache aAnt //au:
OK?//
build files, the XML schema definitions (XSD), and the
extensible stylesheet language transformations (XSLT). XML
is an adequate, if verbose, format for data that programs produce and consume,
but a nightmare for humans looking at anything more complex than what can fit
on a screen. In most programming languages, tokens get a large part of their meaning
from their context in which
they appear. For instance, a word appearing on the left of an open
bracket is a function or method name. Contrast this with XML, where each token is explicitly
assigned its meaning through tags and attributes. For
example, in a make file, we can associate a value with a variable by writing
TESTSRC=test/src
Placement on one side or the other of We the equals sign distinguishes the variable from its
value
by their place around the equals sign. In
the corresponding XML-based aAnt build file, we write the equivalent as
<property
name="testsrc" location="test/src"/>
In this case, we use named attributes to specify
what’s is assigned
to what.
This XML’s approach simplifies the parsing of
arbitrary files, but the corresponding verbosity hinders comprehension and
comfortable programming.
In computer languages, there’s seems to be a
sweet spot between conciseness and wordiness. This spot is
aApparently, the it’s the place where the
means for expressing an idea matches our cognitive ability. Languages
occupying this spot seem to be are the ones in which we achieve long-term
productivity (this includes maintenance). Some
languages or programming styles, like APL and Perl one-liners, have strayed to
extreme conciseness. Other languages, like Cobol and XML, err toward
excessive wordiness. Both extremes hinder the software’s analyzability,
changeability, and stability and, therefore, its maintainability. Even
with the best editor, expressing oneself yourself in XML is a lot less productive than
coding the same ideas in a notation specifically designed for a given problem. To convince
yourself tTry
rewriting a simple make file into its aAnt XML-equivalent. ThereforeSo, if humans will
typically communicate with your software using a language, invest some effort
in designing it properly, rather than relying on the bland (dis)comfort of XML.
Another popular misuse of XML involves the thin -wrapping of arbitrary
data with XML tags. Because XML is so flexible, it’s
easy to take any data format, throw in a few tags in the most convenient
places, and (following the letter of the XML definition) call that an XML
document. Yet,
such documents are difficult to process effectively with standard XML tools.; Ttheir validation
is a charade, and transformations
and queries become all but impossible. Consider, fFor instance, consider the XML file
format used for storing iTunes libraries. Its
generation apparently takes the shortcut of converting Apple’s Core Foundation
types into a so-called property list, which has the outward appearance of looks like XML on the outside. Yet
the contents of such files are key/value pairs, like such as the following:.
<key>Name</key><string>Audiobooks</string>
<key>Playlist
ID</key><integer>94</integer>
In a better, tailor-designed, XML file format, we’ ‘would
expect the
above this
pair to be something like
<name
id=”94”>Audiobooks</name>
A similarly dysfunctional XML file will result if we dump a
relational database in XML as columns, rows, and tables. Again,
we miss the opportunity to express in XML the deeper relationships between our
records, which
is really the strength of XML’s strength.
ThereforeSo, when you’re designing
an XML document, place yourself in the mindset of its consumer. Think,
what’s the best possible structure you would expect? Then
invest in mapping your data into the schema you’ve designed.
Diomidis
Spinellis is an associate
professor in the Department of Management Science and Technology at the Athens
University of Economics and Business and the author of Code Quality:
The Open Source Perspective
(Addison-Wesley, 2006). Contact him at
dds@aueb.gr.
Tools of the Trade
Using and Abusing XML
Diomidis Spinellis
XML has many strengths: computers and humans can
both process it, special tools can validate it, and it promotes robust input
handling. To achieve
interoperability, we should formally define schemas (adopting existing ones,
when possible), and test XML data with different producers and consumers. Formatting the
data in a way that is accessible to both human readers and popular software
tools is also a good practice. XML is also
easily misused. Its adoption as a format for human-produced code, and the thin
wrapping of arbitrary data with XML tags are two popular offences.
keywords: //au: Please supply keywords.//