Was XML Flawed from the Start?
Some of you may have read recently about the latest crisis in the world of XML. (It seems like there's always a crisis in that camp.) Users are discovering that XML is taking huge amounts of bandwidth and is generally very slow compared to other methods. (Here is a recent article on it. Here is another.) Is that supposed to be a surprise?
I should start by mentioning that I've tried to refrain from bashing XML since I first came across it in 1997. Bashing experimental computer languages serves no useful purpose. However, now that many XML users are taking a closer look at the "emperor's new clothes", I think it has come time to ask a few important questions.
Is XML a mistake? Isn't it based on the false assumption that a markup technique is the correct way to exchange semantic information between disparate databases and computer systems. Isn't that backwards? I will explain.
The architects of XML assumed that they could build on the success of HTML (hypertext markup language). As HTML has proven on the web, the markup technique works fairly well for documents. That's because HTML documents normally consist of a higher proportion of human-interpreted content to denotable semantic values (computer interpreted content). In addition, documents tend to be fairly flat in structure compared to other types of data.
It's All About Metadata
The fundamental idea of markup languages is to denote metadata (data that describes other data). These languages assume that the metadata should be quoted ("quote" meaning to delimit the bounds of the data expression). The data itself is not quoted, but that's not really true, because metadata ending tags are used to delimint (quote) the data. Here is a simple example:
The < and the > are the quotes used to denote metadata (tags). Here they indicate that what follows is a name. However, to indicate where the name ends, another tag is needed. So the tags actually work as quotes themselves.
So XML is perhaps the most inefficient representation for data that you could possibly invent! Is it no surprise that it requires so much more storage and bandwidth?
I said that XML is backwards. Here's why. If the majority of your content is semantic (as is most non-document content), it is better to lexically denote the data, quoting it as necessary, and leave the metadata alone. That's what REBOL does. In REBOL, the above example becomes:
name: "Bob Smith"
Here the word is metadata and the quoted string is data. (Don't confuse the fact that I use REBOL's assignment notation here. This is data, not code. REBOL blends both equally well.) Normal quotes are used to bound the string data.
The Structuring of Data
XML uses its metadata tags as "grouping" quotes everywhere. This extends to all levels of structure. So, to create a customer record in XML you write:
<customer> <name>Bob Smith</name> <email>email@example.com</email> <site>http://www.example.com/bob</site> <age>27</age> <phone>555-1212</phone> <city>Ukiah</city> </customer>
The <customer> tags indicate the bounds of the customer data record.
In REBOL, there is a single mechanism for all groups of data: blocks. The above example would be:
customer: [ name: "Bob Smith" email: firstname.lastname@example.org site: http://www.example.com/bob age: 27 phone: #555-1212 city: "Ukiah" ]
Here REBOL's block symbols [ ] are used to indicate the bounds of the customer data record. This method works for all levels of structure.
Redundant Tagging - Duplication of Semantics
The XML situation becomes even worse when a sequential series of values needs to be expressed. Suppose the customer record above is extended to indicate products of customer interest:
<interests> <product>cpu</product> <product>memory</product> <product>disk</product> </interests>
Even if this record were 1000 products long, the same redundant tags would be applied. That's because XML uses tags as delimiters (quotes).
In REBOL, you would recognize that the products all come from the same semantic domain, so you can imply the semantics in such cases and just write:
interests: [cpu memory disk]
It's ok to imply semantics. In the case above we might happen to know that for interests, all the values are of the same type. We don't need to separately denote each value as a <product>. It can be implied here.
It is natural to use these types of reductions. We do this all the time in our natural languages (as well as REBOL). The concept of a sequence (or series) within a context of definition (or vocabulary) can be very useful.
The above example is order independent. But, that's not a requirement. The expression could also have a grammar, even a very simple one, and that's what REBOL dialects are all about. There is nothing wrong with taking advantage of that feature. We don't need to markup everything.
For example, if a large number of customer records are being transferred, a good REBOL designer would probably pick a hybrid method that blends position dependent values in a series with name denoted values for optional items. It might look like this:
[ "Bob Smith" email@example.com http://www.example.com "Ukiah" [age 27 phone #555-1212] ]
I'll write more about this and other approaches in a separate article.
Embedding Code - Code as Data
And finally, what ultimately may be the most serious flaw in XML: embedded code was an after-thought. Code representation in XML actually helps illustrate the extreme inefficiency of XML markup technique.
This problem does not occur in REBOL because the domains of data and code share exactly the same representation. But, that's the subject of a separate article, as is also a discussion of the merits of using a standard for data exchange (XML) as opposed to no standard at all.
Updated 25-May-2023 - Copyright Carl Sassenrath - WWW.REBOL.COM - Edit - Blogger Source Code