(list-of xml problems)

Matt Jadud and Noel Welsh

These notes covers XML processing using the SSAX libraries.

1  The Fundamental Symmetry

XML is Scheme! For example compare:

  <document><xml>This is XML!</xml></document>

to

  (document (xml "This is XML!"))

The Scheme code has the same structure as the XML, but is less verbose. With quoting we can write XML data (in our Scheme format) directly in our code:

  (define document '(document (xml "This is XML!")))

Our mapping of XML to Scheme is called SXML. We've shown how to convert tags. To convert attributes we use a list of lists that starts with a @. For example:

  <a href="http://schematics.sourceforge.net/">Schematics</a>

in SXML is

  (a (@ (href "http://schematics.sourceforge.net/")) "Schematics")

That's all we'll need for now.

1.1  Strings

We've introduced a new data type, the string, on the sly. A string is an array (or vector) of characters. Basically, it represents written text. We can create strings by enclosing text in double quotes ("). That's about all we'll need to know about strings for now.

1.2  xHTML

We're going to use a lot of xHTML examples. If you aren't familiar with xHTML, it's the XML format used to display web pages. (If you are familiar with xHTML, shut up! I know that isn't entirely true, but it's true enough for our purposes.)

All you need to know about HTML is:

1.3  Exercises

Encode a simple HTML document in SXML. Don't make it too complicated, we'll be using it later. Make sure you have a least one heading, one paragraph, and one hyperlink.

2  Processing SXML

We need a data definition of SXML before we can write functions that use SXML. The true data definition is quite complicated, so let's make some simplifying assumptions:

Consider three SXML documents:

From these three examples you should be able to derive a data definition.

2.1  Exercises

Write down the data definition for SXML. Does it cover the HTML document you created earlier? Compare it to the official SXML definition at http://okmij.org/ftp/Scheme/SXML.html, in particular the grammar (section 3).

Write the skeleton code for the natural recursion over our simplified version of SXML (that is, the cond expression you'd use to process our simplified SXML).

Write a function that Microsoftise's your xHTML document. That is, it replaces all headings with a font tag of the form <font size="a number">...</font>, where a number is a number of your choosing.

Write a function that outputs a list of all the URLs (as given in hyperlinks) in your document.

Write a functions that counts the number of paragraphs in your document.

3  Creating SXML

3.1  Quote, Quasiquote and Unquote

We've used quote before to create lists using a convenient notation:

  '(1 2 3 4)

The important part of quoting, is that is turns off evaluation. It reads data ``as is''. So '(1 (+ 1 1) 3 4) is not equal to '(1 2 3 4). It would be nice if we could interleave evaluation and quoting. That is, if we could unquote data. There is a special form of quoting, called quasiquote, that allows us to do exactly this. We introduce quasiquoting using the backtick `. Within quasiquoted data the comma , turns on evaluation.

So

  `(1 ,(+ 1 1) 3 4)

is equal to

  '(1 2 3 4)

Notice that the evaluated data is consed onto the list. If we wanted to append it instead, we'd write

  `(1 ,@(list 2 3) 4)

which is equal to

  '(1 2 3 4)

3.1.1  Exercises

Using quasiquote and unquote, write a function (make-link url text) that constructs a hyperlink from the given url and text. The contract of make-link is

  ;; Contract:
  ;;  make-link: string string -> sxml

and an example is

  (make-link "http://lshift.net/" "LShift")

evaluates to

  (a (@ (href "http://lshift.net/")) "LShift")

Replace the hyperlink in your HTML document with a call to make-link.

3.2  Parsing XML

We've looked at how we can programmatically create SXML. Unsurprisingly, the other way of creating SXML is to parse an existing XML document. This is where we get into the interesting parts of the SSAX toolkit.

SSAX provides a number of parsers, from low- to high-level. You can create your own parser by combining the functions provided in the SSAX toolkit, but we'll just be looking at the high-level parsers.

To use SSAX in PLT Scheme you need to require the SSAX library by putting the following lines in your code:

  (require (lib "ssax.ss" "ssax")
           (lib "myenv.ss" "ssax")
           (lib "input-parse.ss" "ssax"))

Oleg likes to shout, so many of the SSAX procedures have uppercase names. Scheme code is usually case-insensitive, but the PLT port has been made case-sensitive. You need to change your language settings to be case-sensitive to use SSAX.

The simplest way to get parse XML into SXML is:

  ;; Contract:
  ;;  SSAX:XML->SXML : input-port (list-of (cons symbol string)) -> sxml
  (SSAX:XML->SXML input-port namespace-prefix)

3.2.1  Exercises

Create a file containing some XML (e.g. the HTML representation of the web site we've used). Parse this file into SXML using the SSAX:XML->SXML function. Create an input-port using

  (open-input-file file-name)

where filename is a string. Use an empty list for the namespace-prefix.

4  Outputting XML

There are some procedures to convert SXML to XML that are not exported in the PLT build. I've ported them and made them available as sxml-to-html.ss. The main procedure here is

  ;; Contract:
  ;;  SXML->HTML : sxml -> string
  (SXML->HTML tree)

4.1  Exercises

Convert the HTML page we made right at the beginning into an HTML string. Save it to disk using the command

  ;; Contract:
  ;;  save-html : string string -> void
  ;;
  ;; Purpose:
  ;;  Save a HTML string to a file
  (define (save-html filename string)
    (with-output-to-file filename
      (lambda ()
        (display string))
      'replace))

View your page in a web browser. Does it look correct? If not, alter the source and save it again until you've got it working

5  The SSAX Parser

We're now going to look at a SSAX parser that allows us to configure what happens at each interesting event (element open, element close etc.) The SSAX parser is based on a powerful tree fold, similar to the folds we developed over lists. The fold is defined as:

  (define (foldts fdown fup fhere seed tree)
    (cond 
      [(leaf? tree) (fhere seed (leaf-value tree))]
      [(node? tree) (fup seed (foldl (lambda (seed tree)
                                       (foldts fdown fup fhere seed
tree))
                                     (fdown seed)
                                     (node-children tree)))]))

Pretty compilcated I think you'll agree. There are two important points:

  1. The functions fdown, fup and fhere are functions that are called on the way down the tree, up the tree and at leaf nodes respectively.

  2. You can store any state you like in the seed, which is passed to all the user-defined functions (fdown, fup and fhere). All these functions should return a suitable seed.

Let's look at a very simple parser, that counts the number of occurrences of paragraphs. Our seed is the number of occurrences, zero initially. On the way down the tree we increment the counter if we've encountered a paragraph. On the way up and at leaf nodes we simply return the current seed. To create this parser in SSAX we use:

  (SSAX:make-parser
    NEW-LEVEL-SEED
    (lambda (name attrs namespaces content seed)
      (if (equal? name 'p)
          (+ 1 seed)
          seed))
    FINISH-ELEMENT
    (lambda (name attrs namespaces parent-seed seed)
      seed)
    CHAR-DATA-HANDLER
    (lambda (string1 string2 seed)
      seed))

The result is a function that takes two inputs: a port to read XML from, and the initial value of the seed.

Note that SSAX:make-parser uses a form of keyword arguments. We can specify functions to use for all the rubbish that XML supports: processing instructions, doctype definitions and so on.

5.1  Exercises

Write a function that counts the number of paragraphs in your xHTML tree.

Write a function that counts the number of headings.

Write a function that accumulates all the hyperlink href attributes into a list. Hint: use the attrs parameter to get a list of attributes and their values.

6  XML Transformations

6.1  SXML Transformations

A lot of tasks involve transforming XML from one form to another. We can write transformations quite simply using SSAX. We already have a procedure to convert SXML to XML: SXML->HTML. All we have to do is convert the XML to the appropriate SXML format. Let's look at the simplest transform: the transform that does nothing. Ignoring for now all the XML rubbish of processing instructions and so on, we'll return to our basic recursive definition of SXML.

  1. On the way down we do nothing (remember fdown only operates on the seed)

  2. On the way up we cons the current values onto the existing SXML tree. This implies the seed is the SXML document so far, and therefore initial seed is the empty list.

That's all we need to know to write the identity transform. From here we can implement almost any transform we want by changing what we do in fup and fhere.

6.1.1  Exercises

Implement the identity parser. Show it works.

Implement the Microsoftisation transform using SSAX.

6.2  In-Memory Transformations

The SSAX parser operates on a stream of characters keeping on the minimum amount of data in memory. This is efficient but it is not always convenient to write transformations in terms of the SSAX parser. There are a number of functions in the file sxml-tree-transform.ss that provide useful functions for in-memory transformations of SXML to SXML.

The first procedure is our old friend foldts:

  foldts : fdown fup fhere seed sxml-tree -> sxml-tree

There are two other procedures post-order and pre-post-order.

  post-order : sxml-tree bindings -> sxml-tree
  pre-post-order : sxml-tree bindings -> sxml-tree

The bindings are ``stylesheets'' of a sort. The bindings are a list of elements that have the form:

  (name [bindings] . handler)

name is the name of the node to match, for example 'p. There are two special names, *default* which is used if no other binding is found, and *text* which is used if the node is just text. bindings is a list of values to prepend to the current bindings. handler is a function of a single argument, the current SXML tree.

The post-order transform goes down the tree in a depth-first manner and transforms the tree on the way up. So the handler sees the tree after all nodes below it have been transformed. The pre-post-order transform allows the transform to be applied on the way down or the way up. We can specify a pre-order (on the way down) transform by using a binding with the form:

  (name *preorder* . <handler>)

6.2.1  Note

The versions of pre-post-order and post-order in sxml-tree-transform.ss are slightly different from those in the default distribution of SSAX. In the default distribution the handler functions take multiple arguments, an annoying and pointless wrinkle IMHO.

6.2.2  Exercises

Implement a procedure called strip-tags that returns a list of all the strings in your HTML document. Use the function flatten below to flatten a tree of strings into a list of strings. You should only need to use a post-order transform and the *default* and *text* bindings.

  ;; CONTRACT:
  ;;   flatten : tree -> list
  ;;
  ;; PURPOSE:
  ;;   flatten converts a tree (represented as a list of lists)
  ;;   into a flat list containing on the data elements of the tree
  ;;
  ;; EXAMPLE:
  ;;   (flatten '((1 2 3) 4 (5 (6)))) = (1 2 3 4 5 6)
  (define (flatten tree)
    (cond ((null? tree) '())
          (else
           (cond ((pair? (car tree))
                  (flatten (append (car tree) (cdr tree))))
                 (else (cons (car tree)
                             (flatten (cdr tree))))))))

Implement the Microsoftisation transform using post-order.

Implement a transform that adds heading tags to the document in example.sxml. You can load this file used the command

  (define document (read (open-input-file "example.sxml")))

Each section should be changed to an HTML heading (h1, h2 etc. in accordance with the depth of nesting. E.g.: h1 for the top sections, h2 for the subsections under them, and so on. Use a pre-order transformation to do this. The pre-order transform should insert the correct heading and then add new bindings to the transform with the correct heading for subsections.

6.3  SXPath

There are many times when one wants to collate information scattered throughout an XML file. Consider generating a table of contents from an HTML document. Ideally we want to extract all the heading to a certain level (say h3) and generate the table of contents from them. It is tedious to express these steps using SSAX. It would be much nicer if we had a concise notation to express queries into XML documents. SXPath, based on the W3C's XPath, is such a notation.

The basic components of SXPath are converters, which express queries, and combinators, which combine converters to give new converters.

A converter has type:

  converter : (union node (list-of node)) -> (list-of node)

The basic procedures to create converters are:

  node-typeof? : criterion -> (node -> boolean)
  node-equal? : node -> (node -> boolean)
  node-pos : number -> converter
  filter : converter -> converter
  take-until: converter -> converter
  take-after: converter -> converter
  map-union : converter (list-of node) -> converter
  node-reverse : converter
  node-trace : string -> converter

The criterion for node-typeof? above is a symbol, one of the following:

The combinators in SXPath are:

  select-kids : converter -> converter
  node-self : converter -> converter
  node-join : converter ... -> converter
  node-reduce : converter ... -> converter
  node-or : converter ... -> converter
  node-closure : converter -> converter
  node-parent : node -> converter

Ok, any idiot can copy type definitions out of the documentation. How do we use SXPath? Let's do a few examples to get a feel for it. To get started we need to require the SXPath library:

(require (lib "sxml-sxpath.ss" "sxpath"))

We also need to be aware of a little wrinkle in SXPath: from the point-of-view of converters, the empty list () is equivalent to #f. So node-typeof? etc. can be used where-ever a converter is expected.

The simplest query is to use node-typeof?. Using the HTML document (converted to SXML) try:

  ((node-typeof? 'body) document)
  ((node-typeof? 'html) document)

The first should return #f indicating that the root node of your document doesn't match. The second should return #t.

The combinator select-kids returns a converter that given a node, returns the (ordered) subset of its children that satisfy the input converter. Try:

  ((select-kids (node-typeof? 'html)) document)
  ((select-kids (node-typeof? 'h1)) document)

Note that select-kids looks no further than the children of the given node. To look at all descendents we must use the node-closure combinator. Try:

  ((node-closure (node-typeof? 'html)) document)
  ((node-closure (node-typeof? 'h1)) document)
  ((node-closure (node-typeof? 'head)) document)

The node-pos function creates a converter that returns the nth member of a node-set (a list of nodes). It is most useful for getting members from a node-set you've already selected using another converter. Try:

  ((node-closure (node-typeof? 'p)) document)
  ((node-pos 1) ((node-closure (node-typeof? 'p)) document))
  ((node-pos -1) ((node-closure (node-typeof? 'p)) document))

There is also a short-cut notation which this document is too small to contain. Suffice to say it's almost equivalent to an s-expression encoding of the XPath short-cut notation. The procedure

  sxpath : (list-of symbol) -> converter

converts an expression in the short-cut notation into a converter.

6.3.1  Exercises

Write an SXPath expression that creates a table of contents from an HTML document: any h1, h2 or h3 expression. Hint: use node-or

Prove Fermat's last theorem in a manner similar to my exposition of the SXPath short-cut notation.

6.4  Transformations By Pattern Matching

XSLT takes a different approach to transforming XML based on pattern matching. This allows you to write out what the input should "look" like and what the corresponding output should look like.

There is no direct support in SSAX for this but see WebIt! http://celtic.benderweb.net/webit/

An example of a WebIt! transformation is:

(define rossetti-collection
  (book title: "Poems of Christina Rossetti"
        
        (poem title: "Bitter for Sweet"
              poet: "Christina Rossetti"
              tag: "bitter"
              (stanza
               (line "Summer is gone with all its roses,")
               (line "Its sun and perfumes and sweet flowers,")
               (line "Its warm air and refreshing showers:")
               (line "And even Autumn closes."))
              (stanza
               (line "Yea, Autumn's chilly self is going,")
               (line "And winter comes which is yet colder;")
               (line "Each day the hoar-frost waxes bolder")
               (line "And the last buds cease blowing.")))
    ))

(define poetry-stylesheet
  (stylesheet
   
   (xml-macro poem
              (xml-rules
               ((poem title: t poet: a tag: m
                      (stanza (line l1) (line l) ...) ...)
                (xml-template
                 (h4:div (h4:p) (h4:a h4:name: m) (h4:strong t) (h4:br) (h4:em a)
                         (list (h4:p) l1 (list (h4:br) l) ...) ...)))))
   
   (xml-macro toc
              (xml-rules
               ((toc (poem title: t poet: a tag: m . rest) ...)
                (xml-template
                 (list
                  (h4:p) "Table of Contents:"
                  (h4:ul (h4:li (h4:a h4:href: (string-append "#" m) t)) ...))))))
   
   (xml-micro book
              (xml-rules
               ((_ title: bt p ...)
                (h4:html
                 (h4:head (h4:title (xml-template bt)))
                 (h4:body
                  (h4:h1 (xml-template bt))
                  (xml-expand (xml-template (toc p ...)))
                  (xml-expand (xml-template (list p ...))))))))
   
   ))

WebIt! is a full-featured XML framework with implementations of SSAX and SXPath in addition to the transformations above.

Last modified: Wed, Jan 8, 2003, 7:17 pm
HTML conversion by TeX2page 4p4k3