================================================================
XMLTK projects. The description below is incomplete, and some
additional design is necessary for each tool in order to be completely
specified. We assumes familiarity with the paper: "XMLTK: An XML
Toolkit for Scalable XML Stream Processing", in Planx'2002.
1. xgrep. This returns only those elements that satisfy a certain
condition:
xgrep -c p1 -e p2 -k p3 oprel cnst
where:
p1 = the context
p2 = the item
p3 = the key
xgrep copies to the output everything that is not under a context.
For a context, it only copies the items that match an item expression
and that satisfy all operators under -k.
Examples:
% xgrep -c bib -e paper -k year/text() > 1999
returns only papers, and only those that published after 1999
(i.e. they HAVE a year of publication after 1999; if a paper has no
year, then it won't be included).
Output looks like this:
. . .
. . .
. . .
% xgrep -c bib -e paper -k year/text() > 1999 -e *
returns all papers as above, as well as all other elements that are
not papers.
. . .
. . .
. . .
. . .
(only papers with year > 1999 are included, all other elements are
included with no extra checks)
% xgrep -c bib -e paper -k year/text() > 1999 -e * -k publisher LIKE '%Wesley%'
For every paper element it checks if year > 1999; for every other
element it checks if it has a publisher that contains Wesley.
% xgrep -c/bib/* -e author -e title
for every publication it includes only the author and title. Output:
> > . . .
> > >. . .
. . .
. . .
2. xtransf makes simple transformations in the XML file, accoding to a
pattern specified by 'keys' and to a 'template' that shows how the
output has to look.qq
Command line:
xtransf -c p1 -e p2 -k k1 -k k2 . . . -k kn -t template
where:
p1 = the context
p2 = the item
k1, ..., kn = the 'keys'
template = any XML fragment containing special symbols $1,...$n
xtransf copies to the output everything that is not under a context.
Under the context it looks for an item, and, whenever it finds one, it
matches the keys, and replaces the item with a copy of the template
where it substitutes each $i with the key ki.
Examples:
% xtransf -c bib -e paper -k title/text() -t $1
Replaces each element with a element, as follows:
title1
title2
. . .
what happens if a paper has two s ? I propose that only the
first one will be included. (We have the xnest/xunnest operators.)
% transf -c bib -e paper -k title/text() -k author/text()
-t $1 $2
For each return an element of the form ..
.., containing the first and the first
% transf -c bib -e paper ... -t [some template] -e book -k . -t $1
This applies some transformation to every element, and
includes unchanged all elements.
% transf -c bib -e paper ... -t [some template] -e * -k . -t $1
Similarly: now it includes unchaged all elements toerh than
(for which it does some transformation).
3. xinclude This takes an XML file containing 'xlink' elements, and
includes all files pointed to by some of these xlinks. The
pointers can be either to local files or to URLs. The user should
have control over which xlink elements to expand, and which to
leave unchanged. The typical application is in conjunction with
the 'file2xml' tool, which creates an XML file with xlinks to all
files in a directory hierarchy. Then 'xinclude' can be used to
actually include some of those files in the XML document. For
example, assume that directory ~/data contains 500 XML files,
d001.xml, d002.xml, ..., d500.xml. Then one should be able to
concatenate them into a single XML file with a command like:
% file2xml ~/data | xinclude > output.xml
[more switches are probably needed, and perhaps additional usages
of xsort and xgrep]
4. xjoin. This takes two (or more ?) XML files and joins them into a
single file. The granilarity of the join, and the key should be
specified by the user. The join algorithm should make certain
assumptions on the data, e.g. that it is already sorted with
'xsort', and use as little memory as possible. When users want to
join two large XML files, they are expected to run xsort first,
then run xjoin.
For example:
% xjoin -c1 /bib -e1 paper -k1 author/text() bib.xml
-c2 /persons -e2 person -k2 name/text() persons.txt
will produce an output file obtained by joining the fields
in the bib.xml file with the field of the persons.txt file
(presumably adding more data to the and in
). The semantics of xjoin needs to be carefully specified:
the description here is clearly incomplete.
5. xenumerate. This is a very simple tool that enumerates all items
in an XML document, under a given context. For example:
% xenumerate -c /bib -e paper bib.xml
will produce something like that:
. . .
. . .
. . .
. . .
. . .
As another example:
% xenumerate -c /bib/* -e author bib.xml
will enumerate the authors inside each publication:
. . .
. . .
. . .
. . .
. . .
. . .
. . .
6. Implement SIX with DIME. SIX is a binary index that allows every
tool in the toolkit to run faster by skipping portions in the XML
document. CUrrently the SIX is a separate file. For example, the
command:
createSindex -t 100 dblp.xml > dblp.six
creates a new file dblp.six, and every tool that operates on
dblp.xml will use dblp.six, if available. This project would
insert the index in the XML document using the DIME standard:
createSindexDime -t 100 dblp.xml > dblpNew.xml
creates a new XML file, dblpNew.xml, which has the SIX embedded
using the DIME standard. DIME allows XML documents to embed binary
data.
This project involves two parts. First is creating the new tool
'createSindexDime' (a better name is needed, of course !). The
second is modifying the XML parser to extract the SIX from the
DIME.