;; -*- mode: memo -*-
Tutorial: XML toolkit
V1.02 September 5, 2002
Makoto Onizuka (oni@acm.org)
1. What is XML toolkit?
XML toolkit is a framework for a light-weight and high performance
XML data stream processing. It is comprised from two fundamental
libraries (XML TSAX parser, XPath processor) and a collection of
simple XML processing commands (that correspond to plain text
processing commands like cat, head, tail, sort, grep and so forth)
built upon those libraries.
Section 2 explains the XML toolkit commands
and Section 3 explains the XML toolkit libraries and an example.
2. Let's look into the some details of XML toolkit commands.
2.1 What is merit of XML toolkit commands?
There may be many types of data operation demand for XML data
streams, like concatenation, filtering (selection),
re-structuring, sorting, duplicate elimination, calculating
aggregation, and so forth. XML toolkit commands satisfy
these kinds of demands. You can also combine any of the XML
toolkit commands using pipe.
For example,
cat book.txt article.txt | grep title | sort;
can be replaced as
xcat book.xml article.xml | xsort -c "/bib" -e "book" -k "title/text()";
Of course, as the data is written in XML format, one can devise
more complex data operations like 'choose every first
author in every book or article'.
xcat book.xml article.xml | xhead -c "/bib/book" -e "author" -n 1 | xrun /bib/*;
If the data is written in plain text, we need to do much more work
than this using perl or combination of sed, awk, etc.
How do we then compare XML toolkit commands with XSL syntax or
XQuery? Although each toolkit command has only subset functions of
XSL and XQuery, as described above, we can combine any of the toolkit
commands, existing unix commands, and user-implemented data operation
commands using UNIX pipe to realize complex data operations.
2.2 XML toolkit commands details
2.2.1 xcat
a)usage: xcat [-b] [-t type-expr] xmlfile1 xmlfile2 ...;
The -h option shows the above usage message.
b)description:
This command simply concatenates a given XML files (either file
or stdin, either tokenized XML or plain XML) and outputs it to
stdout as plain XML (default) or as tokenized XML (-b option).
The tokenized XML (or we call "binary XML") is such a format
that all tags (element and attribute) is tokenized as integer.
The output XML can be a forest, meaning that there may be several
root elements.
The -t option enables an advanced tokenizer for an integer value
with some constant strings like "$30" or "30 dollar".
For example, xcat -b -t 'xpathExpr:PREFIX%iSUFFIX' tokenizes all
text value whose prefix is "PREFIX" and suffix is "SUFFIX" that
are matched with the given xpathExpr. %i indicates that the value
except PREFIX and SUFFIX is a integer type.
c)example:
xcat article.xml book.xml > bib.xml;
xcat -b article.xml book.xml > bib.xml;
xcat -b -t '//price:$%i' article.xml book.xml > bib.xml;
2.2.2 xrun
a)usage: xrun xpathExpr xmlfile;
"xrun" without any parameters shows the above usage message.
b)description
This command inputs one XPath expression and one XML file (either
tokenized XML or plain XML) and outputs a sub-part of the input
XML that satisfies the given XPath expression. The result can be
a forest. The restriction of the XPath expression is described in
Section 4.
c)example:
xrun "/" xmlfile;
xrun "/bib/book/title" xmlfile;
xrun "//title/text()" xmlfile;
xrun "//book/*/@year" xmlfile;
2.2.3 xsort
More details are written in the PLANX paper.
http://www.cs.washington.edu/homes/suciu/XMLTK/planx.ps
a)usage: xsort [-binary | -b ]
[(-memory | -m) mem-size]
[(-type | -t) type-expr]
[(-context | -c) xpath-expr]
[(-element | -e) xpath-expr]
[(-key | -k) xpath-expr]]]
[xmlfile];
The -help, -h or -? shows the above usage message.
b)description
This command sorts target element (specified by an element XPath
expression) under the context element (specified by a context XPath
expression) using the key element (specified by a key XPath expression)
for an XML input (stdin or file). All elements under the context
are emitted except the elements satisfied by the -e XPath expression.
The -b and -t options play the same role as that of the xcat. The -m
option determines the memory size for in-memory sorting (beyond which
sorting is conducted externally, i.e. on disk).
c)example:
xsort -c "/bib" -e "book" -k "title/text()" xmlfile;
This sorts all "book" element under "bib" element using
"title/text()" as key value.
xsort -c "/bib" -e "book" -k "title" xmlfile;
When you specify the key parameter as an element (not text()),
then the concatenation of all text() under the key element is
interpreted as a key value.
xsort -c "/bib" -e "*" -k "@year" xmlfile;
This sorts all elements under "bib" element using "@year" as
a key value. The key value is always treated as string value
(not as integer value).
xsort -c "/bib/book" -e "author" -k "lastname/text()"
-e "*" xmlfile;
This sorts all "author" element under "book" element using
"lastname/text()" as key value. In addition, all elements except
"author" under the "book" element also are outputed. In this case,
there are two -e XPath expressions for one context element. The xsort
tries to check the first XPath expression, and if it fails then it
tries to check the second XPath expression. So, the order of the -e
XPath expression means a precedence.
xsort -c "/bib" -e "book" -k "title/text()"
-e "article -k "@year" xmlfile;
This sorts all "book" element under "bib" element using
"title/text()" as key value, and sorts "article" element under
"bib" element using "@year" as key value.
xsort -c "/bib/book" -e "author" -k "lastname/text()"
-c "/bib/article" -e "author" -k "firstname/text()" xmlfile;
This sorts all "author" element under "bib/book" element using
"lastname/text()" as key value, and sorts "author" element under
"/bib/article" element using "firstname/text()" as key value.
In this case, there are two context XPath expressions. The xsort
tries to check the first XPath expression, and if it fails then it
tries to check the second XPath expression. So, the order of the -c
XPath expression also means a precedence.
2.2.4 xtail
a)usage: xtail [-help | -h | -? ]
[-binary | -b ]
[(-context | -c) xpath-expr
[(-element | -e) xpath-expr
[(-number | -n) number]]] [xmlfile];
The -help, -h or -? shows the above usage message.
b)description
This command filters the last specified number of target element
(specified by an element XPath expression) under the context element
(specified by a context XPath expression) for an XML input (stdin or
file). All elements under the context are emitted except the elements
satisfied by the -e XPath expression. The -b option plays the same role
as that of the xcat.
c)example:
xtail -c "/bib" -e "//author" -n 2 xmlfile;
xtail -c "/bib" -e "*" -n +1 xmlfile;
xtail -c "/bib" -e "*" -n -1 xmlfile;
2.2.5 xstat
a)usage: xstat xmlfile;
b)description
This command reports the number of elements, the maximum depth
and the average depth of the input XML file. For example, if the
input XML contains only one root element (like ), then
the xstat outputs:
c)example:
xstat bib.xml;
2.2.6 file2xml
a)usage: file2xml [-s start directory name] [-f file to direct output];
b)description
This command collects file and directory information (name, path,
size, permission, type, userid, gourpid, last access time, last
modified time) under a specified start directory and packs it as
XML format. For example, if "CVS" directory contains one "Root"
file, file2xml outputs:
CVS
Root
/homes/june/makoto/work/xmlbase/xmltoolkit/xmltk/file2xml/CVS/Root
33
-rwrr----
regular file
13750
330
Wed Nov 21 11:22:33 2001
Wed Nov 21 11:22:23 2001
c)example:
file2xml -s . -f currentFile.xml;
2.2.7 createSindex
a)usage: createSindex [-t threshHold] xmlfile;
b)description
This command creates a streaming index (SIX) for the input XML file.
If the XML file's name is "xmlfile.xml", the index is stored at
"xmlfile.six" in the same directory as the XML file.
The "-t" option specifies the threshHold (byte) for the index
entry deletion.
Whenever there is a SIX file for its XML file, the XPath processor
automatically uses the index to skip parsing certain part of XML.
There are two type skips:
[fail skip]
When we receive a startElement SAX event and it is not what we need,
we want to skip the whole element and go to the nest sibling
element. For example, if we want to find a book element (/bib/book)
and receive an article startElement event, then we skip the article
element (Element/attribute based fail skip).
Another example is that we want to find a book that is published at
1999 (/bib/book[@year='1999']) but receive a book start element
event that is published at 2001 (Value based fail skip).
[succeed skip]
When we receive an endElement SAX event and we don't need to look
into the following siblings, we can skip them all. For example, if
we want to find a second book's first author (/bib/book[2]/author[1])
and receive an end element SAX event for the author element, then we
can skip 2 depth to go to bib endElement (Element/attribute based
succeed skip).
Another 1 depth succeed skip example is that we have a key constraints
on isbn of books and we found a book whose isbn matches our condition
(/bib/book[isbn='XX123'], value based succeed skip).
Some future work on the SIX is to enable skip parsing using
key constraints, a schema specifying a order of sibling elements
(e.g. DTD content model like (A, B*, C)), sorted elements, and
existential filter condition.
c)example:
createSindex bib.xml;
This creates a full streaming index (bib.six) for the bib.xml.
createSindex -t 100 bib.xml;
This creates a partial streaming index (bib.six) for the bib.xml
whose entry is only for elements larger than 100 byte.
3. Let's look into the some details of XML toolkit libraries.
3.1 What is functionality?
There are two libraries: one is a tokenized XML parser, and
the other is XPath processor.
The tokenized XML parser converts every element tag or attribute
into an unique integer value (we implement as XTOKEN) and calls
tokenized SAX events. For example, startElement('bib') can
be tokenized as startElement(5) and the value '5' corresponds to
'bib'. We extends the Xmill's XML parser to implement our tokenized
XML parser, as it parses XML data stream around ten times faster
than the xerces XML parser. But it just supports ascii character code
(it doesn't support unicode) and does not have DTD validation.
The XPath processor combined with the above tokenized XML parser
calls only such SAX events that satisfy a given collection of
XPath expressions. In other words, it filters out any SAX events
that does not satisfy any given XPath expressions. This is effective
for such a case that you want to operate some particular part of the
input XML data stream (not whole of the XML data).
For example, if you want to make a title list of bibliography, then
you need to register two XPath expression '/bib/book/title' as
$x1 and '/bib/article/title' as $x2 (or simply one XPath expression
'//title' or '/bib/*/title'). If the input XML data stream is
Data on the web
Serge Abiteboul
Peter Buneman
Dan Suciu
Processing XML Streams with Deterministic ...
Todd J Green
Makoto Onizuka
Dan Suciu
then the XPath processor calls only the following call-back events:
startContext($1) // for '/bib/book/title'
startElement('title')
character('Data on the web')
endElement('title')
endContext($1)
startContext($2) // for '/bib/article/title'
startElement('title')
character('Processing XML Streams with Deterministic ...')
endElement('title')
endContext($2)
You can find there are additional events 'startContext' and
'endContext'. These enable the applications to know which XPath
expression satisfies which SAX events.
3.2 How does one develop a new application using XML toolkit library?
If you would like to build a new application or command using XML
toolkit library, you need to understand:
1) com-lite interface
2) main functions
3) handler functions (including SAX events)
4) library functions
3.2.1 com-lite interface
TJ wrote this part already. See
URL http://www.cs.washington.edu/homes/suciu/XMLTK/comlite.html
The concept is that it utilizes the factory class design pattern
and smart pointer.
3.2.2 main functions
Here, I explain library functions using the xrun example.
#include "xmltk.h"
int main(int argc, char* argv[]){
if (argc != 3){
cout << "Usage: " << argv[0] << " xpathExpr xmlfile" << endl;
return 1;
}
// initialize token table that convert string
// (element, attribute, extented integer) to XTOKEN
InitGlobalTokenTable();
//
// before the parsing
//
try {
IDfaFilter *g_pfilter = NULL;
// create XML parser and XPath processor object
// 1st parameter: Interface ID
// 2nd parameter: result
CreateDfaFilter(&IID_IDfaFilter, (void**)&g_pfilter);
// register a given XPath to the XPath processor.
// 1st parameter: Variable *, a variable for a parent XPath expression
// 2nd parameter: char *, XPath expression
// 3rd parameter: book, echo toggle (described at Section 3.4.1)
// The XPath processor manages the construction/destruction of
// the returned Variable *.
Variable * g_pvRoot = g_pfilter->RegisterQuery(NULL, argv[1], true);
// create an input file stream object
// 1st parameter: char *, file path
IFileStream *pstm = _CreateFileStream(argv[2]);
// create an output file stream object
// 1st parameter: bool, binary format or plain format
ITSAXContentHandler *pchOut = CreateStdoutStream(false);
// create a handler that implements call-back functions
// 1st parameter: ITSAXContentHandler *
myHandler *handler = new myHandler(pchOut);
// register hander to XPath processor
// 1st parameter: IDfaFilter *
// 2nd parameter: user class derived from IFilteredTSAXHandler
IUnknownCL_SetHandler(g_pfilter, handler);
//
// parse (call-back functions are invoked from the parse)
// 1st parameter: IFileStream *
// 2nd parameter: IDfaFilter *
ParseUnknownStream(pstm, g_pfilter);
//
// after the parsing
// ATOMICRELEASE deletes the parameter and set it as NULL;
ATOMICRELEASE(pchOut);
ATOMICRELEASE(pstm);
ATOMICRELEASE(g_pfilter);
}
catch (_Error &err) {
// print an error message
err.perror();
}
// cleanup the token table
CleanupGlobalTokenTable();
return 0;
}
3.2.3 Hanlder functions
Similar to handling the SAX events, you need to implement your
handler to receive events from the XPath processor (or XML parser)
using ITSAXContentHandler or IFilteredTSAXHandler interface.
ITSAXContentHandler interface defines SAX events and
IFilteredTSAXHandler interface defines both SAX events and context
events (startContext and endContext).
class myHandler : public IFilteredTSAXHandler
{
public:
//
// *** IUnknownCL methods ***
// CL_STDMETHOD is a macro to define a bool method
// CL_STDMETHOD_ is a macro to define a method whose return
// type is specified by the first parameter
//
CL_STDMETHOD(QueryInterface) (RCLIID riid, void **ppvObj);
CL_STDMETHOD_(ULONG,AddRef) ();
CL_STDMETHOD_(ULONG,Release) ();
//
// *** ITSAXContentHandler methods ***
//
CL_STDMETHOD(startDocument) ();
CL_STDMETHOD(endDocument) ();
CL_STDMETHOD(startElement) (XTOKEN xtName);
CL_STDMETHOD(endElement) (XTOKEN xtName);
CL_STDMETHOD(attribute) (XTOKEN xtName, char *pszChars, int cchChars);
CL_STDMETHOD(characters) (char *pszChars, int cchChars);
CL_STDMETHOD(cdata) (char *pszChars, int cchChars);
CL_STDMETHOD(extendedint) (XTOKEN xt, int iInt);
//
// *** IFilteredTSAXHandler methods ***
//
CL_STDMETHOD(startContext) (Variable *pv);
CL_STDMETHOD(endContext) (Variable *pv);
myHandler(ITSAXContentHandler *pchOut) {
m_pchOut = pchOut;
m_pchOut->AddRef();
}
virtual ~myHandler(){
ATOMICRELEASE(m_pchOut);
}
ITSAXContentHandler *m_pchOut;
ULONG m_cRef;
};
//
// *** IUnknown methods ***
//
// RCLIID is a static value that indicates a class ID
bool myHandler::QueryInterface(RCLIID riid, void **ppvObj)
{
if (IsEqualCLIID(riid, &IID_IUnknownCL) ||
IsEqualCLIID(riid, &IID_ITSAXContentHandler) ||
IsEqualCLIID(riid, &IID_IFilteredTSAXHandler)){
*ppvObj = static_cast(this);
}
else {
*ppvObj = NULL;
return false;
}
AddRef();
return true;
}
ULONG myHandler::AddRef(){
return ++m_cRef;
}
ULONG myHandler::Release(){
if (--m_cRef > 0)
{
return m_cRef;
}
delete this;
return 0;
}
You need follow the com-lite style because the XML Toolkit class
library is defined in com-lite style and you need to define your
own class as a sub-class of those library classes. Thus, you
need to implement all IUnknownCL methods in the way like above.
You can implement anything you like for every SAX call-back function,
ITSAXContentHandler and IFilteredTSAXHandler methods.
You can also implement anything you like on the constructer and
destructor of your handler. You may want to initialize
your own data member, or may need to free your data member memory
space.
3.2.4 library functions
a) g_ptt: g_ptt is a global variable points to a global token table.
char * StrFromXTOKEN(XTOKEN token);
returns the original string of the "token".
XTOKEN XTOKENFromStr(char * s, XST_TYPE);
XST_TYPE=: (XST_ELEMENT | XST_ATTRIBUTE | XST_EXTENTEDINT)
returns a token of the "s" as type "XST_TYPE".
The XST_TYPE is eigther XST_ELEMENT, XST_ATTRIBUTE, or XST_EXTENTEDINT.
b) Variable class: The Variable instance is returned by
the IDfaFilter::RegisterQuery().
Variable * getParent(void);
returns a variable for the parent XPath expression.
void setUserSpace(void * s);
You can set any pointer to your own memory space.
void* getUserSpace(void);
returns the pointer to the memory space.
3.3 Advanced feature of the tokenized XML processor
3.3.1 Input
The input XML data can be either tokenized XML or ordinary plain
text XML. The tokenized XML (or we call "binary XML") is such a
format that all tags (element and attribute) is tokenized as integer.
3.3.2 Call-back event
In addition to the same as SAX events (startDocument, endDocument,
startElement, endElement, characters, cdata), there are additional
and modified events.
a) characters(char *pszChars, int cchChars);
cdata(char *pszChars, int cchChars);
Not like ordinary XML SAX parser, the pszChars is a pointer to
the character data WITHOUT STRING TERMINATOR. So you need to
know the length of the string. The ccChars tells you the length
of the string.
b) attribute(XTOKEN xtName, char *pszChars, int cchChars);
Not like the ordinally SAX event interface, an attribute is
independent from its element and "attribute" event is invoked for
each attribute.
c) extendedint(XTOKEN xt, int iInt);
This event is a special case for character SAX event. If the user
registers some extended integer using "RegisterType" method with
a parameter in 'xpathExpr:PREFIX%iSUFFIX' format, then an
extendedint event is invoked instead of a character event.
d) context events: startContext(Variable *pv), endContext(Variable *pv)
As decribed in Section 3.1, these enable the applications to know
which XPath expression satisfies which SAX events. The Variable *
is returned when the user register an XPath expression.
3.4 Advanced feature of the XPath processor
3.4.1 Echo toggle (ON/OFF)
When the user register XPath expression, he/she needs to specify
an echo toggle (ON/OFF) by the third parameter of "RegisterQuery"
method.
RegisterQuery(Variable * parentVariable,
const char * xpath,
bool echoToggle);
If the toggle is ON (true), then both the context events and all
the SAX events during the given XPath expression is satisfied are
call-backed (as described in Section 3.1).
If the toggle is OFF (false), then the only context events are
invoked. None of SAX events are invoked.
3.4.2 Echo overwrite
The echo toggle of an XPath expression can be overwritten by
its child XPath expression. For example, if the user want to
output all element except "title" element, then he/she realizes
it as follows:
Variable * v1 = g_pfilter->RegisterQuery(NULL, "/", true);
Variable * v2 = g_pfilter->RegisterQuery(v1, "//title", false);
This directs all SAX events invoked under "/" XPath expression
unless "//title" XPath expression is satisfied.
3.4.3 Precedence
Like xsort examples, the user sometimes need to use some
conditional procedure:
IF ... THEN ...
ELSE IF ... THEN ...
....
ELSE ...
The XPath processor supports this functionality by introducing
a precedence in XPath expressions.
RegisterQuery(Variable * parentVariable,
const char * xpath,
bool echoToggle
float precedence);
The forth parameter of the "RegisterQuery" method indicates the
precedence. When several XPath expressions become satisfied, the
XPath processor only invokes startContext events with the highest
precedence. Precisely to say, the XPath processor calculates the
minimum precedence in the satisfied XPath expressions, and locates
and invokes startContext event for variables with the minimum
precedence.
4. Restrictions
4.1 The predicate expressions
== supported ==
1. The predicate needs to be written with an attribute or value.
If the predicate is binary, it has to be [attribute operator value]
order.
e.g. //book[@year]/title, //book[@year=1999]/title
2. The position predicate can be used at any location step.
e.g. /bib/book[2]/author[1]
3. Predicates can be concatinated.
e.g. //book[2][@year=1999]/title[contains(text(),'XML')/text()
== not supported (restrictions) ==
4. The tail location step can not have the predicate written in
the above, because the lazyDFA does not support output XML
buffering.
n.g. //book[@year]
o.k. //book[@year]/title
5. DoubleSlash with position predicate is not supported.
n.g. /bib/book//name[2]
"book" has to control the increment/reset the
number of "name".
6. Position predicate following another predicate is not supported.
n.g. //book[@year=1998][2]
o.k. //book[2][@year=1998]
"@year" has to control the increment and
"book" has to control the reset.
7. Element or "." is not supported in the predicate.
n.g. //book[contains(.,'XML')]/title
8. Such XPath expressions that uses both "//" and any predicate does
not work correctly. The lazyDFA does not output error message in
this case.
Precisely saying, such XPath expressions does not work correctly if
it contains some predicate and it matches with prural XML fragment
simaltaniously. For example, //a/a[1] matches two XML fragment in
the following XML.
<-- 1st context starts here
<-- 1st context completed here.
simaltaniously, 2nd context starts here.
123
123
456
456