Got XML? Thanks to the efforts of standards groups like the W3C that have diligently supported and created guidelines for data formats being used on the Web, XML (Extensible Markup Language) is now the most commonly accepted syntax for creating documents for use online whether it be for web or internal use. Essentially what XML provides is a way for programs and machines to share data across platforms or portability.
But now that you have an XML repository, what are you going to do with it? What and how may you deploy it? Simply having large masses of XML converted data doesn't necessarily mean that the data in this form is even useful.
Enter XQuery. XQuery facilitates the ability to extract data intelligently from both real and virtual documents, and thereby enables the interaction between the web world and the database world. Its main benefit is essentially providing the ability to search and access XML files like databases and extract and manipulate data from XML documents or any data source that can be rendered as XML.
Norm Walsh, Principal Technologist at Mark Logic Corporation describes how the combination of XQuery and XML will provide the flexibility and functionality required to manage and optimize an XML repository.
The first step in leveraging a company's information assets is to transform them into XML. XML provides the open, uniform platform on top of which can be built sophisticated applications to deliver dynamic content across multiple systems enabling better identification and sharing of both external and internal content.
XML allows the full richness of content (or "unstructured data", if you prefer) to be maintained while still providing a description of how the data is structured. Different data types use different structures. Just as books have titles, parts, chapters, paragraphs, purchase orders contain dates, addresses, items, and prices while scientific journals have articles, titles, paragraphs, tables, figures and images. You get to decide what structure best reflects the information that your organization needs in its documents. Graphics, media, and other resources can be stored alongside the XML text.
One important observation is that while most companies have a wealth of information that either is or could be in XML, analysts estimate that as much as 70% of total corporate data is still unstructured. Moreover, since all content is not the same, finding a solution that calls for all content to fit into a single structure is either impossible to manage or requires a structure so loose that it will contain little useful information. Of course, even if you could fit all of today's information into a single structure, or even a small number of structures, new information would inevitably arrive tomorrow, so flexibility is also a key characteristic for managing data and data structures. XML gives you that flexibility.
But no matter how much the virtues of XML are extolled, at the end of the day a big pile of XML is just that, a big pile of XML. XML isn't going to do the job all by itself. After all an organization's content is made accessible, tools are then required that will help take advantage of the content by enabling intelligent access, identification, and ultimately sharing and reuse. Enter XQuery, one of the best tools around to accomplish this.
So what is XQuery? Well, "XML Query Language", or XQuery for short, is a World Wide Web Consortium (W3C) specification that provides flexible query facilities to XML content. XQuery extracts data from real and virtual documents and collections both locally and on the Web, providing interaction between the Web world and the database world. It is a standardized way of searching through semi-structured data that is either physically stored as XML or virtualized as XML. The XQuery effort at W3C is lead by the XML Query Working Group, whose purpose is to develop open standards so that XML query evolves in a single direction.
|
"What separates XQuery from the pack is that it was designed from the ground up to be an XML language."
|
XQuery is part of a family of standards from the W3C designed to address, query, and transform XML documents. XML is so ubiquitous these days that almost any programming language has some support for XML. What separates XQuery from the pack is that it was designed from the ground up to be an XML language. This isn't a general purpose programming language with XML bolted on, this is a precise, efficient tool designed from its onset to work natively with XML.
The problem with other programming languages isn't that they aren't able to process XML, it's that they aren't able to process XML efficiently. Data has to be converted from XML to the language's native data structures. Once converted, it must be manipulated with functions that don't understand the underlying model and are, consequently, not always a good fit. This "impedance mismatch" causes confusion and can introduce errors. Finally, the programming language structures have to be converted back into XML. Each of these steps is tedious, time consuming, and introduces the possibility of errors. In a sophisticated application, this process may have to occur several times for each XML resource.
On the other hand, XQuery's native data model is XML. XQuery's functions are designed to operate on XML. None of this conversion is necessary and impedance mismatches don't occur. What's more, in the context of an XML Server, the fundamental efficiencies of XQuery are augmented by a powerful database and a sophisticated search engine that, like XQuery itself, have been designed from the beginning to operate uniquely on XML.
|
"Implementing rich applications with XQuery is remarkably easy because three extremely important features arrive in one package: the ability to search your content, the ability to select content, and the ability to transform content."
|
Implementing rich applications with XQuery is remarkably easy because three extremely important features arrive in one package: the ability to search your content, the ability to select content, and the ability to transform content.
Consider a simple example. Suppose you have a large collection of documents that fall under some regulatory control process. You might have thousands of documents that apply to Alabama, thousands that apply to Alaska, etc. However, you need to use this collection of documents to create a new information product that includes regulations about widgets in Massachusetts. Simply selecting all of the documents that apply to Massachusetts isn't useful because they won't all be about widgets. By the same token, a simple full-text search for widgets isn't useful because it'll find documents for every state. What you need is the ability to apply the search to only those documents relevant to both widgets and Massachusetts. The powerful combination of XML and XQuery lets you do exactly this.
Additionally, an emerging category of products called XML Servers are also available that index all of the content put into the database, both the markup and the data. That means that selective queries like the one just described can be performed without any a prior knowledge of what sorts of queries you're going to need to perform.
With this power in hand, organizations can build and deploy new applications that take information assets in directions never imagined before. Consider these three examples:
-
A large publisher with a huge archive of material decides to make a new publication targeted at a specific market. Luckily, they already have all their content in XML. XQuery allows them to efficiently find all of the relevant content and transform it to suit the look-and-feel of the new publication. It's so easy and successful that they now plan to generate a new product like this ever three months.
-
A medical center with access to a large collection of medical journals and text books uses XQuery to find relevant images in their reference materials. This allows diagnosticians to do a real-time comparison between patient x-rays and likely diagnoses.
-
A services company uses the power of XQuery and geospatial searching capability to provide information that not only satisfies their customer's requests, but also provides useful alternatives and services that are known to be geographically near the answer to the customer's primary query.
One prerequisite for all of these success stories is a large volume of XML content from which to draw material. But what if your company doesn't have a lot of XML? What if it isn't yet willing to commit to an XML-centric work flow right away?
The good news is that most organizations today probably have more XML than they realize. As XML continues to grow in ubiquity, we see applications like Microsoft Word and Adobe's InDesign using XML as their native storage format. The best XML Servers, unlike other databases and search engines, can work with any schema. These products are ideally positioned to leverage this new content immediately. What's more, most XML Servers comes with built in conversion tools for other formats such as PDF, which allow content to be found and potentially repurposed without significant intervention.
Organizations seeking an open, uniform platform on top of which to build sophisticated, content-centric applications should consider vendor solutions that have XQuery at their core.
|
About the Author
Norman Walsh is a Principal Technologist in the Information & Media Solutions team at Mark Logic Corporation http://www.marklogic.com/. He is also an active participant in a number of standards efforts worldwide. Mr. Walsh is an elected member of the Technical Architecture Group at the W3C where he is also chair of the XML Processing Model Working Group and co-chair of the XML Core Working Group. With more than a decade of industry experience, Mr. Walsh is well known for his work on DocBook, is the principle author of DocBook: The Definitive Guide as well as numerous papers and presentations. At OASIS, he is chair of the DocBook Technical Committee and a member of the RELAX NG Technical Committees. Mr. Walsh was editor of the XML Catalogs specification for the Entity Resolution Technical Committee and wrote the implementation of that OASIS Standard that is part of the XML Commons project at Apache. He was a specification lead for the Java API for XML Processing (JAXP) and has participated occasionally in other XML-related JSRs. Before joining Mark Logic, Mr. Walsh was a XML Standards Architect at Sun Microsystems, Inc.
|
DCLnews Editorial
December 2008