================================================================

XMLTK projects.  The description below is incomplete, and some
additional design is necessary for each tool in order to be completely
specified.  We assumes familiarity with the paper: "XMLTK: An XML
Toolkit for Scalable XML Stream Processing", in Planx'2002.

1. xgrep.  This returns only those elements that satisfy a certain
   condition:

   xgrep -c p1  -e p2  -k p3 oprel cnst

   where:
   	p1 = the context
   	p2 = the item
   	p3 = the key

   xgrep copies to the output everything that is not under a context.
   For a context, it only copies the items that match an item expression
   and that satisfy all operators under -k.

   Examples:

   % xgrep -c bib -e paper -k year/text() > 1999
   returns only papers, and only those that published after 1999
   (i.e. they HAVE a year of publication after 1999; if a paper has no
   year, then it won't be included).

   Output looks like this:

          <bib>
            <paper>  . . . </paper>
            <paper>  . . . </paper>
           . . .
          </bib>

   % xgrep -c bib -e paper -k year/text() > 1999  -e *
   returns all papers as above, as well as all other elements that are
   not papers.


          <bib>
            <paper>  . . . </paper>
            <book>  . . . </book>
            <paper>  . . . </paper>
           . . .
          </bib>

   (only papers with year > 1999 are included, all other elements are
   included with no extra checks)


   % xgrep -c bib -e paper -k year/text() > 1999  -e * -k publisher LIKE '%Wesley%'
   For every paper element it checks if year > 1999; for every other
   element it checks if it has a publisher that contains Wesley.

   % xgrep -c/bib/* -e author -e title
   for every publication it includes only the author and title.  Output:

          <bib>
            <paper> <title> </> <author> </>   . . . </paper>
            <book>  <author> </> <title> </> <author> </>. . . </book>
            <paper>  . . . </paper>
           . . .
          </bib>


2. xtransf makes simple transformations in the XML file, accoding to a
   pattern specified by 'keys' and to a 'template' that shows how the
   output has to look.qq

   Command line:

   xtransf -c p1 -e p2 -k k1 -k k2 . . . -k kn -t template

   where:
   	p1 = the context
   	p2 = the item
   	k1, ..., kn = the 'keys'
   	template = any XML fragment containing special symbols $1,...$n

   xtransf copies to the output everything that is not under a context.
   Under the context it looks for an item, and, whenever it finds one, it
   matches the keys, and replaces the item with a copy of the template
   where it substitutes each $i with the key ki.

   Examples:

   % xtransf -c bib -e paper -k title/text() -t <result> $1 </result>

   Replaces each <paper> element with a </result> element, as follows:

   	 <bib>
   	     <result> title1 </result>
   	     <result> title2 </result>
   	     . . . 
   	 </bib>

   what happens if a paper has two <title>s ? I propose that only the
   first one will be included.  (We have the xnest/xunnest operators.)


   % transf -c bib -e paper -k title/text() -k author/text()
            -t <r> <t> $1 </t> <a> $2 </a> </r>

   For each <paper> return an element of the form <r> <t> ..</t>
   <a>..</a></r>, containing the first <title> and the first <author>

   % transf -c bib -e paper ... -t [some template] -e book -k . -t $1

   This applies some transformation to every <paper> element, and
   includes unchanged all <book> elements.

   % transf -c bib -e paper ... -t [some template] -e * -k . -t $1

   Similarly: now it includes unchaged all elements toerh than <paper>
   (for which it does some transformation).


3. xinclude This takes an XML file containing 'xlink' elements, and
   includes all files pointed to by some of these xlinks.  The
   pointers can be either to local files or to URLs.  The user should
   have control over which xlink elements to expand, and which to
   leave unchanged.  The typical application is in conjunction with
   the 'file2xml' tool, which creates an XML file with xlinks to all
   files in a directory hierarchy.  Then 'xinclude' can be used to
   actually include some of those files in the XML document.  For
   example, assume that directory ~/data contains 500 XML files,
   d001.xml, d002.xml, ..., d500.xml.  Then one should be able to
   concatenate them into a single XML file with a command like:

   % file2xml ~/data | xinclude > output.xml
   [more switches are probably needed, and perhaps additional usages
   of xsort and xgrep]

4. xjoin.  This takes two (or more ?) XML files and joins them into a
   single file.  The granilarity of the join, and the key should be
   specified by the user.  The join algorithm should make certain
   assumptions on the data, e.g. that it is already sorted with
   'xsort', and use as little memory as possible.  When users want to
   join two large XML files, they are expected to run xsort first,
   then run xjoin.

   For example:

   % xjoin -c1 /bib -e1 paper -k1 author/text() bib.xml
           -c2 /persons -e2 person -k2 name/text() persons.txt

   will produce an output file obtained by joining the <author> fields
   in the bib.xml file with the <name> field of the persons.txt file
   (presumably adding more data to the <papers> and <books> in
   <bib>).  The semantics of xjoin needs to be carefully specified:
   the description here is clearly incomplete.

5. xenumerate.  This is a very simple tool that enumerates all items
   in an XML document, under a given context.  For example:

   % xenumerate -c /bib -e paper  bib.xml

   will produce something like that:

   <bib>
      <paper number="0"> . . . </paper>
      <book> . . . </book>
      <paper number="1"> . . . </paper>
      <paper number="2"> . . . </paper>
      . . .
   </bib>

   As another example:

   % xenumerate -c /bib/* -e author bib.xml

   will enumerate the authors inside each publication:

   <bib>
      <paper>
          <author number="0"> . . . </author>
          <author number="1"> . . . </author>
          . . .
      </paper>
      <book>
          <author number="0"> . . . </author>
          <author number="1"> . . . </author>
          . . .
      </book>
      . . . 
   </bib>

6. Implement SIX with DIME.  SIX is a binary index that allows every
   tool in the toolkit to run faster by skipping portions in the XML
   document.  CUrrently the SIX is a separate file.  For example, the
   command:

   createSindex -t 100 dblp.xml > dblp.six

   creates a new file dblp.six, and every tool that operates on
   dblp.xml will use dblp.six, if available.  This project would
   insert the index in the XML document using the DIME standard:

   createSindexDime -t 100 dblp.xml > dblpNew.xml

   creates a new XML file, dblpNew.xml, which has the SIX embedded
   using the DIME standard.  DIME allows XML documents to embed binary
   data.

   This project involves two parts.  First is creating the new tool
   'createSindexDime' (a better name is needed, of course !).  The
   second is modifying the XML parser to extract the SIX from the
   DIME.