(list-of xml problems)
These notes covers XML processing using the SSAX libraries.
XML is Scheme! For example compare:
<document><xml>This is XML!</xml></document>
to
(document (xml "This is XML!"))
The Scheme code has the same structure as the XML, but is less verbose. With quoting we can write XML data (in our Scheme format) directly in our code:
(define document '(document (xml "This is XML!")))
Our mapping of XML to Scheme is called SXML. We've shown how to convert tags. To convert attributes we use a list of lists that starts with a @. For example:
<a href="http://schematics.sourceforge.net/">Schematics</a>
in SXML is
(a (@ (href "http://schematics.sourceforge.net/")) "Schematics")
That's all we'll need for now.
We've introduced a new data type, the string, on the sly. A string is an array (or vector) of characters. Basically, it represents written text. We can create strings by enclosing text in double quotes ("). That's about all we'll need to know about strings for now.
We're going to use a lot of xHTML examples. If you aren't familiar with xHTML, it's the XML format used to display web pages. (If you are familiar with xHTML, shut up! I know that isn't entirely true, but it's true enough for our purposes.)
All you need to know about HTML is:
There is some boilerplate text you need to put around an HTML document. It looks like this:
<html> <head><title>Your Title Here</title></head> <body> Your text here </body> </html>
Inside the body you can use the a number of tags:
For headings use
<h1>The Heading</h1>
Replace h1 with any value from h2 to h6 for subsection headings.
For paragraphs use
<p>The text</p>
For hyperlinks use
<a href="the url">The link text</a>
Encode a simple HTML document in SXML. Don't make it too complicated, we'll be using it later. Make sure you have a least one heading, one paragraph, and one hyperlink.
We need a data definition of SXML before we can write functions that use SXML. The true data definition is quite complicated, so let's make some simplifying assumptions:
Assume attributes always occur
Assume we already have a data definition for attributes
Consider three SXML documents:
The simplest possible SXML document: ()
A simple document: (document (@ ()) "Hello")
A slightly more complex document (document (@ ()) (text
"Hello"))
From these three examples you should be able to derive a data definition.
Write down the data definition for SXML. Does it cover the HTML document you created earlier? Compare it to the official SXML definition at http://okmij.org/ftp/Scheme/SXML.html, in particular the grammar (section 3).
Write the skeleton code for the natural recursion over our simplified
version of SXML (that is, the cond
expression you'd use to
process our simplified SXML).
Write a function that Microsoftise's your xHTML document. That is, it
replaces all headings with a font tag of the form <font size="a
number">...</font>
, where a number is a number of your choosing.
Write a function that outputs a list of all the URLs (as given in hyperlinks) in your document.
Write a functions that counts the number of paragraphs in your document.
We've used quote
before to create lists using a convenient notation:
'(1 2 3 4)
The important part of quoting, is that is turns off evaluation. It
reads data ``as is''. So '(1 (+ 1 1) 3 4)
is not equal to
'(1 2 3 4)
. It would be nice if we could interleave evaluation
and quoting. That is, if we could unquote data. There is a special
form of quoting, called quasiquote, that allows us to do exactly this.
We introduce quasiquoting using the backtick `
. Within
quasiquoted data the comma ,
turns on evaluation.
So
`(1 ,(+ 1 1) 3 4)
is equal to
'(1 2 3 4)
Notice that the evaluated data is cons
ed onto the list. If we
wanted to append it instead, we'd write
`(1 ,@(list 2 3) 4)
which is equal to
'(1 2 3 4)
Using quasiquote and unquote, write a function (make-link url
text)
that constructs a hyperlink from the given url and text. The
contract of
make-link
is
;; Contract: ;; make-link: string string -> sxml
and an example is
(make-link "http://lshift.net/" "LShift")
evaluates to
(a (@ (href "http://lshift.net/")) "LShift")
Replace the hyperlink in your HTML document with a call to
make-link
.
We've looked at how we can programmatically create SXML. Unsurprisingly, the other way of creating SXML is to parse an existing XML document. This is where we get into the interesting parts of the SSAX toolkit.
SSAX provides a number of parsers, from low- to high-level. You can create your own parser by combining the functions provided in the SSAX toolkit, but we'll just be looking at the high-level parsers.
To use SSAX in PLT Scheme you need to require
the SSAX library
by putting the following lines in your code:
(require (lib "ssax.ss" "ssax") (lib "myenv.ss" "ssax") (lib "input-parse.ss" "ssax"))
Oleg likes to shout, so many of the SSAX procedures have uppercase names. Scheme code is usually case-insensitive, but the PLT port has been made case-sensitive. You need to change your language settings to be case-sensitive to use SSAX.
The simplest way to get parse XML into SXML is:
;; Contract: ;; SSAX:XML->SXML : input-port (list-of (cons symbol string)) -> sxml (SSAX:XML->SXML input-port namespace-prefix)
Create a file containing some XML (e.g. the HTML representation of the web site we've used). Parse this file into SXML using the SSAX:XML->SXML function. Create an input-port using
(open-input-file file-name)
where filename is a string. Use an empty list for the
namespace-prefix
.
There are some procedures to convert SXML to XML that are not exported in the PLT build. I've ported them and made them available as sxml-to-html.ss. The main procedure here is
;; Contract: ;; SXML->HTML : sxml -> string (SXML->HTML tree)
Convert the HTML page we made right at the beginning into an HTML string. Save it to disk using the command
;; Contract: ;; save-html : string string -> void ;; ;; Purpose: ;; Save a HTML string to a file (define (save-html filename string) (with-output-to-file filename (lambda () (display string)) 'replace))
View your page in a web browser. Does it look correct? If not, alter the source and save it again until you've got it working
We're now going to look at a SSAX parser that allows us to configure what happens at each interesting event (element open, element close etc.) The SSAX parser is based on a powerful tree fold, similar to the folds we developed over lists. The fold is defined as:
(define (foldts fdown fup fhere seed tree) (cond [(leaf? tree) (fhere seed (leaf-value tree))] [(node? tree) (fup seed (foldl (lambda (seed tree) (foldts fdown fup fhere seed tree)) (fdown seed) (node-children tree)))]))
Pretty compilcated I think you'll agree. There are two important points:
The functions fdown, fup and fhere are functions that are called on the way down the tree, up the tree and at leaf nodes respectively.
You can store any state you like in the seed, which is passed to all the user-defined functions (fdown, fup and fhere). All these functions should return a suitable seed.
Let's look at a very simple parser, that counts the number of occurrences of paragraphs. Our seed is the number of occurrences, zero initially. On the way down the tree we increment the counter if we've encountered a paragraph. On the way up and at leaf nodes we simply return the current seed. To create this parser in SSAX we use:
(SSAX:make-parser NEW-LEVEL-SEED (lambda (name attrs namespaces content seed) (if (equal? name 'p) (+ 1 seed) seed)) FINISH-ELEMENT (lambda (name attrs namespaces parent-seed seed) seed) CHAR-DATA-HANDLER (lambda (string1 string2 seed) seed))
The result is a function that takes two inputs: a port to read XML from, and the initial value of the seed.
Note that SSAX:make-parser
uses a form of keyword arguments. We
can specify functions to use for all the rubbish that XML supports:
processing instructions, doctype definitions and so on.
Write a function that counts the number of paragraphs in your xHTML tree.
Write a function that counts the number of headings.
Write a function that accumulates all the hyperlink href attributes
into a list. Hint: use the attrs
parameter to get a list of
attributes and their values.
A lot of tasks involve transforming XML from one form to another. We
can write transformations quite simply using SSAX. We already have a
procedure to convert SXML to XML: SXML->HTML
. All we have to do
is convert the XML to the appropriate SXML format. Let's look at the
simplest transform: the transform that does nothing. Ignoring for now
all the XML rubbish of processing instructions and so on, we'll
return to our basic recursive definition of SXML.
On the way down we do nothing (remember fdown only operates on the seed)
On the way up we cons
the current values onto the existing
SXML tree. This implies the seed is the SXML document so far, and
therefore initial seed is the empty list.
That's all we need to know to write the identity transform. From here we can implement almost any transform we want by changing what we do in fup and fhere.
Implement the identity parser. Show it works.
Implement the Microsoftisation transform using SSAX.
The SSAX parser operates on a stream of characters keeping on the minimum amount of data in memory. This is efficient but it is not always convenient to write transformations in terms of the SSAX parser. There are a number of functions in the file sxml-tree-transform.ss that provide useful functions for in-memory transformations of SXML to SXML.
The first procedure is our old friend foldts
:
foldts : fdown fup fhere seed sxml-tree -> sxml-tree
There are two other procedures post-order
and
pre-post-order
.
post-order : sxml-tree bindings -> sxml-tree pre-post-order : sxml-tree bindings -> sxml-tree
The bindings are ``stylesheets'' of a sort. The bindings are a list of elements that have the form:
(name [bindings] . handler)
name
is the name of the node to match, for example
'p
. There are two special names,
*default*
which is used if no other binding is found, and
*text*
which is used if the node is just text.
bindings
is a list of values to prepend to the
current bindings. handler
is a function of a single
argument, the current SXML tree.
The post-order
transform goes down the tree in a depth-first
manner and transforms the tree on the way up. So the handler sees the
tree after all nodes below it have been transformed. The
pre-post-order
transform allows the transform to be applied
on the way down or the way up. We can specify a pre-order (on the way
down) transform by using a binding with the form:
(name *preorder* . <handler>)
The versions of pre-post-order and post-order in
sxml-tree-transform.ss
are slightly different from those in
the default distribution of SSAX. In the default distribution the
handler
functions take multiple arguments, an annoying and
pointless wrinkle IMHO.
Implement a procedure called strip-tags
that returns a list
of all the strings in your HTML document. Use the function
flatten
below to flatten a tree of strings into a list of
strings. You should only need to use a post-order
transform
and the *default*
and *text*
bindings.
;; CONTRACT: ;; flatten : tree -> list ;; ;; PURPOSE: ;; flatten converts a tree (represented as a list of lists) ;; into a flat list containing on the data elements of the tree ;; ;; EXAMPLE: ;; (flatten '((1 2 3) 4 (5 (6)))) = (1 2 3 4 5 6) (define (flatten tree) (cond ((null? tree) '()) (else (cond ((pair? (car tree)) (flatten (append (car tree) (cdr tree)))) (else (cons (car tree) (flatten (cdr tree))))))))
Implement the Microsoftisation transform using post-order
.
Implement a transform that adds heading tags to the document in
example.sxml
. You can load this file used the command
(define document (read (open-input-file "example.sxml")))
Each section
should be changed to an HTML heading
(h1, h2 etc.
in accordance with the depth of nesting. E.g.:
h1
for the top sections, h2
for the subsections
under them, and so on. Use a pre-order transformation to do this.
The pre-order transform should insert the correct heading and then add
new bindings to the transform with the correct heading for
subsections.
There are many times when one wants to collate information scattered
throughout an XML file. Consider generating a table of contents from
an HTML document. Ideally we want to extract all the heading to a
certain level (say h3
) and generate the table of contents
from them. It is tedious to express these steps using SSAX. It would
be much nicer if we had a concise notation to express queries into XML
documents. SXPath, based on the W3C's XPath, is such a notation.
The basic components of SXPath are converters, which express queries, and combinators, which combine converters to give new converters.
A converter has type:
converter : (union node (list-of node)) -> (list-of node)
The basic procedures to create converters are:
node-typeof? : criterion -> (node -> boolean) node-equal? : node -> (node -> boolean) node-pos : number -> converter filter : converter -> converter take-until: converter -> converter take-after: converter -> converter map-union : converter (list-of node) -> converter node-reverse : converter node-trace : string -> converter
The criterion for node-typeof?
above is a symbol, one of the
following:
id
tests if the Node has the right name (id)
@
tests if the Node is an <attributes-coll>
*
tests if the Node is an <Element>
*text*
tests if the Node is a text node
*PI*
tests if the Node is a PI node
*any*
#t
for any type of Node
The combinators in SXPath are:
select-kids : converter -> converter node-self : converter -> converter node-join : converter ... -> converter node-reduce : converter ... -> converter node-or : converter ... -> converter node-closure : converter -> converter node-parent : node -> converter
Ok, any idiot can copy type definitions out of the documentation. How do we use SXPath? Let's do a few examples to get a feel for it. To get started we need to require the SXPath library:
(require (lib "sxml-sxpath.ss" "sxpath"))
We also need to be aware of a little wrinkle in SXPath: from the
point-of-view of converters, the empty list ()
is equivalent
to #f
. So node-typeof?
etc. can be used where-ever
a converter is expected.
The simplest query is to use node-typeof?
. Using the HTML
document (converted to SXML) try:
((node-typeof? 'body) document) ((node-typeof? 'html) document)
The first should return #f
indicating that the root node of
your document doesn't match. The second should return #t
.
The combinator select-kids
returns a converter that given a
node, returns the (ordered) subset of its children that satisfy the input
converter. Try:
((select-kids (node-typeof? 'html)) document) ((select-kids (node-typeof? 'h1)) document)
Note that select-kids
looks no further than the children of the given
node. To look at all descendents we must use the
node-closure
combinator. Try:
((node-closure (node-typeof? 'html)) document) ((node-closure (node-typeof? 'h1)) document) ((node-closure (node-typeof? 'head)) document)
The node-pos
function creates a converter that returns the
nth member of a node-set (a list of nodes). It is most
useful for getting members from a node-set you've already selected
using another converter. Try:
((node-closure (node-typeof? 'p)) document) ((node-pos 1) ((node-closure (node-typeof? 'p)) document)) ((node-pos -1) ((node-closure (node-typeof? 'p)) document))
There is also a short-cut notation which this document is too small to contain. Suffice to say it's almost equivalent to an s-expression encoding of the XPath short-cut notation. The procedure
sxpath : (list-of symbol) -> converter
converts an expression in the short-cut notation into a converter.
Write an SXPath expression that creates a table of contents from an
HTML document: any h1
, h2
or h3
expression. Hint: use node-or
Prove Fermat's last theorem in a manner similar to my exposition of the SXPath short-cut notation.
XSLT takes a different approach to transforming XML based on pattern matching. This allows you to write out what the input should "look" like and what the corresponding output should look like.
There is no direct support in SSAX for this but see WebIt! http://celtic.benderweb.net/webit/
An example of a WebIt! transformation is:
(define rossetti-collection (book title: "Poems of Christina Rossetti" (poem title: "Bitter for Sweet" poet: "Christina Rossetti" tag: "bitter" (stanza (line "Summer is gone with all its roses,") (line "Its sun and perfumes and sweet flowers,") (line "Its warm air and refreshing showers:") (line "And even Autumn closes.")) (stanza (line "Yea, Autumn's chilly self is going,") (line "And winter comes which is yet colder;") (line "Each day the hoar-frost waxes bolder") (line "And the last buds cease blowing."))) )) (define poetry-stylesheet (stylesheet (xml-macro poem (xml-rules ((poem title: t poet: a tag: m (stanza (line l1) (line l) ...) ...) (xml-template (h4:div (h4:p) (h4:a h4:name: m) (h4:strong t) (h4:br) (h4:em a) (list (h4:p) l1 (list (h4:br) l) ...) ...))))) (xml-macro toc (xml-rules ((toc (poem title: t poet: a tag: m . rest) ...) (xml-template (list (h4:p) "Table of Contents:" (h4:ul (h4:li (h4:a h4:href: (string-append "#" m) t)) ...)))))) (xml-micro book (xml-rules ((_ title: bt p ...) (h4:html (h4:head (h4:title (xml-template bt))) (h4:body (h4:h1 (xml-template bt)) (xml-expand (xml-template (toc p ...))) (xml-expand (xml-template (list p ...)))))))) ))
WebIt! is a full-featured XML framework with implementations of SSAX and SXPath in addition to the transformations above.