Using XML and XSLT for managing content

    I’ve been helping these folks out lately with a project that involves managing a ridiculous amount of content with some rather strict formatting requirements. basically they are an online bible software site (though with a much grander vision), and they were needing a way to take multiple bible translations and index them within a relational database, maintaining formatting like paragraphs, indents, poetry / prose, red letter text, verse numbering, etc…

    

     Currently they have all the content stored as HTML with all of thier spans and divs, css classes… and they are storing all of this in MySql. This wouldn’t be so much of a problem if they were dealing with a small amount of content, like a small cms based website… but with the sheer mass of content that they are managing here, they’ve run into several limitations.

Enter XML and XSLT…

http://en.wikipedia.org/wiki/Xml
http://en.wikipedia.org/wiki/XSLT

    

     By using XML with XSLT, you can build a separation between how your document elements are defined, and how they are formatted / styled. XML defines the document, and all of it’s nodes… it says … ‘this node is a paragraph, this one is poetry, this other one here contains the words of Jesus’… XSL defines the styles that are applied to those nodes… and uses XSLT to describe how the document is built as another XML document that uses this formatting definition.

       

     so… the data is saved in the database as pure XML, and you have a XSLT available, that contains all of your <div>, <span> and associated CSS classes… the browser (or your javascript) will then transform that into XHTML, which is a form of XML that the browser understands. from there, you can use javascript to access the DOM in order to modify the appearance of the CSS on the fly.

    

     XML is a much better storage format than XHTML is… which means less for the browser to download, less for the database to chunk through and query. Since the XSLT is cached on the browser side… the browser is not required to re-download that style definition every time, and it is only given data and data definitions as XML from that point forward. before you ask ‘doesn’t css do that?’ … the answer is ‘yes’ … and browsers will also cache the CSS that you give them in a XML / XSLT / CSS format… but the difference is that since the browser is doing the job of building the XHTML with all of your divs and spans (based on the XSLT you give it)… you don’t need to store and serve all of the XHTML formatting information at every request.

    

     If you pre-parse the XML into XHTML, store and serve that… you lose the independence and graceful degredation that XML gives you. You’d be back to playing the format wars… battling it out between what-ever browser / client / technology we’re wanting to allow to view the content (think cell phones, pda, flash, ajax, safari, opera, IE, firefox, etc… ). You’d also have to build additional classes (and probably multiple XHTML versions) in order to offer the content to these different players. if all you’re storing and serving is pure XML with XSLT … you can define different XSLT for each of these formats (if there are major differences) instead of having to store all the data in XHTML for them all. this is especially important in the case of bible document storage because of the massive amount of content, and the number of translations that are available. think of how difficult it would be to manage even a handful of translations, and then have to maintain separate XHTML content for each of the browser types that you wanted to support!

    

     another thing to consider is working with existing XML standards. many times publishers of specific types of content like the bible will work together to come up with a file format that they can agree on. in the case of bible formatting, there are really only two XML based formats available, USFX (Unified Scripture Format XML), which is an XML version of the backslash formatted USFM (Unified Standard Format Markers), and OSIS (Open Scripture Information Standard).

    

     I’ve chosen USFX, but will not discuss my reasons here, as it is a bit out of scope for this article. I’ll get into some of the specifics of USFX along with some details on how to make use of XSLT in another article.   I do think though that it is important to at least explain why it is good to adopt an existing document standard if you’re able to.  If you’re curious, this is what USFM looks like:

Text - Matthew 1Open Link in New Window (GNT)

\io1 The last week in and near Jerusalem (21.127.66)

\io1 The resurrection and appearances of the Lord (28.1-20)

\c 1

\s1 The Ancestors of Jesus Christ

\r (Luke 3.23-38Open Link in New Window)

\p

\v 1 This is the list of the ancestors of Jesus Christ, a descendant of David, who was a descendant of Abraham.

And… this same text translated into USFX:

<!--l version="1.0" encoding="utf-8--> 
<usfx xmlns:ns0="http://ebt.cx/usfx/usfx-2005-09-08.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="usfx-2005-09-08.xsd"></usfx> 
<book></book>
<p level="1" sfm="io">The last week in and near Jerusalem (21.1?27.66)</p>
 
<p level="1" sfm="io">The resurrection and appearances of the Lord (28.1-20)               
 
<c id="1"></c> 
<s level="1">The Ancestors of Jesus Christ </s>
 
<p sfm="r">(Luke 3.23-38)               
 
<v id="1"></v>This is the list of the ancestors of Jesus Christ, a descendant of David, who was a descendant of Abraham.

     The primary benefit of adopting an existing document standard if there is one available is that you are able to stand on the shoulders of greater men, saving you months, or even years of work trying to come up with a standard on your own. While there might be some things missing within an existing schema, you can always add that to another namespace. USFM offers the \z (<z:>) namespace for this very purpose:

check USFM officially recommends that any additional user generated markup should begin with \z (e.g. \zMyMarker). Markers in this namespace will not be considered a part of the USFM standard, or be generally supported in USFM aware applications. This will become a kind of “private use area”. It will become the user or tool builder’s responsibility to support support specific \z markup in ways which meet a local need. Other USFM processing tools cannot be expected to handle \z markup or associated text, and are free to ignore them when they are encountered in the text.

     This gives you the room to be able to use formatting that is otherwise missing from the standard, while still taking advantage of the tremendous amount of effort that these folks have gone through to create it to begin with. So why recreate the wheel? After all, if nothing is ever adopted… how can it really be considered a ’standard’?

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Technorati
  • TwitThis
  • De.lirio.us
  • E-mail this story to a friend!
  • Fark
  • LinkedIn
  • Slashdot
  • Socialogs
  • StumbleUpon

2 Responses to “Using XML and XSLT for managing content”

  1. Matt Says:

    There are also some very helpful tools in SQL Server 2k/2k5 and Oracle (not sure about other databases) to use transact-sql to query xml stored in xml data type columns. No need to learn something new, just apply the sql knowledge you already have to the xml you are storing inside of your database.

    But wait why store it in a database? Do you really want to return a multi-MB file back to the user in order for them to view anything? Wouldn’t having the ability to return chunks of the xml from a database based on parameters be far more effective?

    Taking that notion a bit further, and applying some concepts of what an interface might look like. You could allow the user’s screen realestate (resolution) to drive what the “size” of a page is. That is the screen is a 15″ monitor well we can fit roughly X number of words on a 800×600 screen so lets make that our “page” and allow paging based on the user’s resolution. It’s like a book that you could stretch to make the pages bigger in turn reducing the amount of pages. Pretty cool idea…

  2. Jim Says:

    MySql doesn’t have any integrated XML support, but PostgreSQL does have an optional XML handling package that is distributed with the core server, but does not have an XML datatype. they recommend storing XML data in text fields. They do support validation, indexing, searching (via XPath), and transforming… but they don’t support XQuery or SQL/XML Syntax. Most of what they do support is pushed through /contrib/xml2 functions.

    There is rumor of an XML data type in MySql at some point in the future and possibly some form of XQuery support, but no ETA on that. Either way, this is really not that big of an issue though because most of the languages that you’d use are going to have native XML support built in anyhow. The only difference being that you can’t use DBMS native Sql to query xml nodes for information within table columns, it would have to be done from outside of the DBMS.

    -jim

Leave a Reply

You must be logged in to post a comment.