================================================================ XMLTK projects. The description below is incomplete, and some additional design is necessary for each tool in order to be completely specified. We assumes familiarity with the paper: "XMLTK: An XML Toolkit for Scalable XML Stream Processing", in Planx'2002. 1. xgrep. This returns only those elements that satisfy a certain condition: xgrep -c p1 -e p2 -k p3 oprel cnst where: p1 = the context p2 = the item p3 = the key xgrep copies to the output everything that is not under a context. For a context, it only copies the items that match an item expression and that satisfy all operators under -k. Examples: % xgrep -c bib -e paper -k year/text() > 1999 returns only papers, and only those that published after 1999 (i.e. they HAVE a year of publication after 1999; if a paper has no year, then it won't be included). Output looks like this: . . . . . . . . . % xgrep -c bib -e paper -k year/text() > 1999 -e * returns all papers as above, as well as all other elements that are not papers. . . . . . . . . . . . . (only papers with year > 1999 are included, all other elements are included with no extra checks) % xgrep -c bib -e paper -k year/text() > 1999 -e * -k publisher LIKE '%Wesley%' For every paper element it checks if year > 1999; for every other element it checks if it has a publisher that contains Wesley. % xgrep -c/bib/* -e author -e title for every publication it includes only the author and title. Output: </> <author> </> . . . </paper> <book> <author> </> <title> </> <author> </>. . . </book> <paper> . . . </paper> . . . </bib> 2. xtransf makes simple transformations in the XML file, accoding to a pattern specified by 'keys' and to a 'template' that shows how the output has to look.qq Command line: xtransf -c p1 -e p2 -k k1 -k k2 . . . -k kn -t template where: p1 = the context p2 = the item k1, ..., kn = the 'keys' template = any XML fragment containing special symbols $1,...$n xtransf copies to the output everything that is not under a context. Under the context it looks for an item, and, whenever it finds one, it matches the keys, and replaces the item with a copy of the template where it substitutes each $i with the key ki. Examples: % xtransf -c bib -e paper -k title/text() -t <result> $1 </result> Replaces each <paper> element with a </result> element, as follows: <bib> <result> title1 </result> <result> title2 </result> . . . </bib> what happens if a paper has two <title>s ? I propose that only the first one will be included. (We have the xnest/xunnest operators.) % transf -c bib -e paper -k title/text() -k author/text() -t <r> <t> $1 </t> <a> $2 </a> </r> For each <paper> return an element of the form <r> <t> ..</t> <a>..</a></r>, containing the first <title> and the first <author> % transf -c bib -e paper ... -t [some template] -e book -k . -t $1 This applies some transformation to every <paper> element, and includes unchanged all <book> elements. % transf -c bib -e paper ... -t [some template] -e * -k . -t $1 Similarly: now it includes unchaged all elements toerh than <paper> (for which it does some transformation). 3. xinclude This takes an XML file containing 'xlink' elements, and includes all files pointed to by some of these xlinks. The pointers can be either to local files or to URLs. The user should have control over which xlink elements to expand, and which to leave unchanged. The typical application is in conjunction with the 'file2xml' tool, which creates an XML file with xlinks to all files in a directory hierarchy. Then 'xinclude' can be used to actually include some of those files in the XML document. For example, assume that directory ~/data contains 500 XML files, d001.xml, d002.xml, ..., d500.xml. Then one should be able to concatenate them into a single XML file with a command like: % file2xml ~/data | xinclude > output.xml [more switches are probably needed, and perhaps additional usages of xsort and xgrep] 4. xjoin. This takes two (or more ?) XML files and joins them into a single file. The granilarity of the join, and the key should be specified by the user. The join algorithm should make certain assumptions on the data, e.g. that it is already sorted with 'xsort', and use as little memory as possible. When users want to join two large XML files, they are expected to run xsort first, then run xjoin. For example: % xjoin -c1 /bib -e1 paper -k1 author/text() bib.xml -c2 /persons -e2 person -k2 name/text() persons.txt will produce an output file obtained by joining the <author> fields in the bib.xml file with the <name> field of the persons.txt file (presumably adding more data to the <papers> and <books> in <bib>). The semantics of xjoin needs to be carefully specified: the description here is clearly incomplete. 5. xenumerate. This is a very simple tool that enumerates all items in an XML document, under a given context. For example: % xenumerate -c /bib -e paper bib.xml will produce something like that: <bib> <paper number="0"> . . . </paper> <book> . . . </book> <paper number="1"> . . . </paper> <paper number="2"> . . . </paper> . . . </bib> As another example: % xenumerate -c /bib/* -e author bib.xml will enumerate the authors inside each publication: <bib> <paper> <author number="0"> . . . </author> <author number="1"> . . . </author> . . . </paper> <book> <author number="0"> . . . </author> <author number="1"> . . . </author> . . . </book> . . . </bib> 6. Implement SIX with DIME. SIX is a binary index that allows every tool in the toolkit to run faster by skipping portions in the XML document. CUrrently the SIX is a separate file. For example, the command: createSindex -t 100 dblp.xml > dblp.six creates a new file dblp.six, and every tool that operates on dblp.xml will use dblp.six, if available. This project would insert the index in the XML document using the DIME standard: createSindexDime -t 100 dblp.xml > dblpNew.xml creates a new XML file, dblpNew.xml, which has the SIX embedded using the DIME standard. DIME allows XML documents to embed binary data. This project involves two parts. First is creating the new tool 'createSindexDime' (a better name is needed, of course !). The second is modifying the XML parser to extract the SIX from the DIME.