inferno-os/doc/ebookimp.ms

*46439007SCharles.Forsyth.TL
*46439007SCharles.ForsythNavigating Large XML Documents on Small Devices
*46439007SCharles.Forsyth.AU
*46439007SCharles.ForsythRoger Peppe
*46439007SCharles.Forsyth.AI
*46439007SCharles.ForsythVita Nuova
*46439007SCharles.Forsyth.br
*46439007SCharles.ForsythApril 2002
*46439007SCharles.Forsyth.AB
*46439007SCharles.ForsythBrowsing eBooks on platforms with limited memory presents an
*46439007SCharles.Forsythinteresting problem: how can memory usage be bounded despite
*46439007SCharles.Forsyththe need to view documents that may be much larger than the
*46439007SCharles.Forsythavailable memory. A simple interface to an XML parser enables
*46439007SCharles.Forsyththis whilst retaining much of the ease of access afforded
*46439007SCharles.Forsythby XML parsers that read all of a document into memory at once.
*46439007SCharles.Forsyth.AE
*46439007SCharles.Forsyth.SH
*46439007SCharles.ForsythIntroduction
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe Open Ebook Publication Structure was devised by the Open Ebook Forum
*46439007SCharles.Forsythin order to ``provide a specification for representing the content of electronic
*46439007SCharles.Forsythbooks''. It is based on many existing standards, notably XML and HTML.
*46439007SCharles.ForsythAn Open eBook publication consists of a set of documents bound together
*46439007SCharles.Forsythwith an Open eBook package file which enumerates all the documents,
*46439007SCharles.Forsythpictures and other items that make up the book
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe underlying document format is essentially HTML compatible,
*46439007SCharles.Forsythwhich is where the first problem arises: HTML was not designed to
*46439007SCharles.Forsythmake it easy to view partial sections of a document. Conventionally
*46439007SCharles.Forsythan entire HTML document is read in at once and rendered onto
*46439007SCharles.Forsyththe device. When viewing an eBook on a limited-memory device,
*46439007SCharles.Forsythhowever, this may not be possible; books tend to be fairly large.
*46439007SCharles.ForsythFor such a device, the ideal format would keep the book itself
*46439007SCharles.Forsythin non-volatile storage (e.g. flash or disk) and make it possible
*46439007SCharles.Forsythfor reader to seek to an arbitrary position in the book and render
*46439007SCharles.Forsythwhat it finds there.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThis is not possible in an HTML or XML document, as the
*46439007SCharles.Forsytharbitrarily nested nature of the format means that every
*46439007SCharles.Forsythposition in the document has some unknown surrounding context,
*46439007SCharles.Forsythwhich cannot be discovered without reading sequentially through
*46439007SCharles.Forsyththe document from the beginning.
*46439007SCharles.Forsyth.SH
*46439007SCharles.ForsythSAX and DOM
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThere are two conventional programming interfaces to an XML
*46439007SCharles.Forsythparser. A SAX parser provides a stream of XML entities, leaving
*46439007SCharles.Forsythit up to the application to maintain the context. It is not possible
*46439007SCharles.Forsythto rewind the stream, except, perhaps, to the beginning.
*46439007SCharles.ForsythUsing a SAX parser is
*46439007SCharles.Forsythfairly straightforward, but awkward: the stream-like nature
*46439007SCharles.Forsythof the interface does not map well to the tree-like structure
*46439007SCharles.Forsyththat is XML. A DOM parser reads a whole document into an internal
*46439007SCharles.Forsythdata structure representation, so a program can treat it exactly
*46439007SCharles.Forsythas a tree. This also enables a program to access parts of the
*46439007SCharles.Forsythdocument in an arbitrary order.
*46439007SCharles.ForsythThe DOM approach is all very well for small documents, but for large
*46439007SCharles.Forsythdocuments the memory usage can rapidly grow to exceed
*46439007SCharles.Forsyththe available memory capacity. For eBook documents, this is unacceptable.
*46439007SCharles.Forsyth.SH
*46439007SCharles.ForsythA different approach
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe XML parser used in the eBook browser is akin to a SAX parser,
*46439007SCharles.Forsythin that only a little of the XML structure is held in memory at one time.
*46439007SCharles.ForsythThe first significant difference is that the XML entities returned are
*46439007SCharles.Forsythtaken from one level of the tree - if the program does not wish to
*46439007SCharles.Forsythsee the contents of a particular XML tag, it is trivial to skip over.
*46439007SCharles.ForsythThe second significant difference is that random access is possible.
*46439007SCharles.ForsythThis possibility comes from the observation that if we have visited
*46439007SCharles.Forsytha part of the document we can record the context that we found there
*46439007SCharles.Forsythand restore it later if necessary. In this scheme, if we wish to return later to
*46439007SCharles.Forsytha part of a document that we are currently at, we can create a ``mark'',
*46439007SCharles.Forsytha token that holds the current context; at some later time we can use
*46439007SCharles.Forsyththat mark to return to this position.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe eBook browser uses this technique to enable random access
*46439007SCharles.Forsythto the document on a page-by-page basis. Moreover a mark
*46439007SCharles.Forsythcan be written to external storage, thus allowing an external
*46439007SCharles.Forsyth``index'' into the document so it is not always necessary to
*46439007SCharles.Forsythread the entire document from the start in order to jump to a particular
*46439007SCharles.Forsythpage in that document.
*46439007SCharles.Forsyth.SH
*46439007SCharles.ForsythThe programming interface
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe interface is implemented by a module named
*46439007SCharles.Forsyth.CW Xml ,
*46439007SCharles.Forsythwhich provides a
*46439007SCharles.Forsyth.CW Parser
*46439007SCharles.Forsythadt which gives access to the contents of an XML document.
*46439007SCharles.ForsythXml items are represented by an
*46439007SCharles.Forsyth.CW Item
*46439007SCharles.Forsythpick adt with one branch of the pick corresponding to each
*46439007SCharles.Forsythtype of item that might be encountered.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe interface to the parser looks like this:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythopen: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string);
*46439007SCharles.ForsythParser: adt {
*46439007SCharles.Forsyth    next:       fn(p: self ref Parser): ref Item;
*46439007SCharles.Forsyth    down:   fn(p: self ref Parser);
*46439007SCharles.Forsyth    up:     fn(p: self ref Parser);
*46439007SCharles.Forsyth    mark:   fn(p: self ref Parser): ref Mark;
*46439007SCharles.Forsyth    atmark: fn(p: self ref Parser, m: ref Mark): int;
*46439007SCharles.Forsyth    goto:   fn(p: self ref Parser, m: ref Mark);
*46439007SCharles.Forsyth    str2mark:   fn(p: self ref Parser, s: string): ref Mark;
*46439007SCharles.Forsyth};
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythTo start parsing an XML document, it must first be
*46439007SCharles.Forsyth.CW open ed;
*46439007SCharles.Forsyth.CW warning
*46439007SCharles.Forsythis a channel on which non-fatal error messages will be sent
*46439007SCharles.Forsythif they are encountered during the parsing of the document.
*46439007SCharles.ForsythIt can be nil, in which case warnings are ignored.
*46439007SCharles.ForsythIf the document is opened successfully, a new
*46439007SCharles.Forsyth.CW Parser
*46439007SCharles.Forsythadt, say
*46439007SCharles.Forsyth.I p ,
*46439007SCharles.Forsythis returned.
*46439007SCharles.ForsythCalling
*46439007SCharles.Forsyth.CW \fIp\fP.next
*46439007SCharles.Forsythreturns the next XML item at the current level of the tree. If there
*46439007SCharles.Forsythare no more items in the current branch at the current level, it
*46439007SCharles.Forsythreturns
*46439007SCharles.Forsyth.CW nil .
*46439007SCharles.ForsythWhen a
*46439007SCharles.Forsyth.CW Tag
*46439007SCharles.Forsythitem is returned,
*46439007SCharles.Forsyth.CW \fIp\fP.down
*46439007SCharles.Forsythcan be used to descend ``into'' that tag; subsequent calls of
*46439007SCharles.Forsyth.CW \fIp\fP.next
*46439007SCharles.Forsythwill return XML items contained within the tag,
*46439007SCharles.Forsythand
*46439007SCharles.Forsyth.CW \fIp\fP.up
*46439007SCharles.Forsythreturns to the previous level.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythAn
*46439007SCharles.Forsyth.CW Item
*46439007SCharles.Forsythis a pick adt:
*46439007SCharles.Forsyth.P1
*46439007SCharles.ForsythItem: adt {
*46439007SCharles.Forsyth    fileoffset: int;
*46439007SCharles.Forsyth    pick {
*46439007SCharles.Forsyth    Tag =>
*46439007SCharles.Forsyth        name:   string;
*46439007SCharles.Forsyth        attrs:      Attributes;
*46439007SCharles.Forsyth    Text =>
*46439007SCharles.Forsyth        ch:     string;
*46439007SCharles.Forsyth        ws1, ws2: int;
*46439007SCharles.Forsyth    Process =>
*46439007SCharles.Forsyth        target: string;
*46439007SCharles.Forsyth        data:       string;
*46439007SCharles.Forsyth    Doctype =>
*46439007SCharles.Forsyth        name:   string;
*46439007SCharles.Forsyth        public: int;
*46439007SCharles.Forsyth        params: list of string;
*46439007SCharles.Forsyth    Stylesheet =>
*46439007SCharles.Forsyth        attrs:      Attributes;
*46439007SCharles.Forsyth    Error =>
*46439007SCharles.Forsyth        loc:        Locator;
*46439007SCharles.Forsyth        msg:        string;
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth};
*46439007SCharles.Forsyth.P2
*46439007SCharles.Forsyth.CW Item.Tag
*46439007SCharles.Forsythrepresents a XML tag, empty or not. The XML
*46439007SCharles.Forsythfragments
*46439007SCharles.Forsyth.CW "<tag></tag>" '' ``
*46439007SCharles.Forsythand
*46439007SCharles.Forsyth.CW "<tag />" '' ``
*46439007SCharles.Forsythlook identical from the point of view of this interface.
*46439007SCharles.ForsythA
*46439007SCharles.Forsyth.CW Text
*46439007SCharles.Forsythitem holds text found in between tags, with adjacent whitespaces merged
*46439007SCharles.Forsythand whitespace at the beginning and end of the text elided.
*46439007SCharles.Forsyth.CW Ws1
*46439007SCharles.Forsythand
*46439007SCharles.Forsyth.CW ws2
*46439007SCharles.Forsythare non-zero if there was originally whitespace at the beginning
*46439007SCharles.Forsythor end of the text respectively.
*46439007SCharles.Forsyth.CW Process
*46439007SCharles.Forsythrepresents an XML processing request, as found between
*46439007SCharles.Forsyth.CW "<?....?>" '' ``
*46439007SCharles.Forsythdelimiters.
*46439007SCharles.Forsyth.CW Doctype
*46439007SCharles.Forsythand
*46439007SCharles.Forsyth.CW Stylesheet
*46439007SCharles.Forsythare items found in an XML document's prolog, the
*46439007SCharles.Forsythformer representing a
*46439007SCharles.Forsyth.CW "<!DOCTYPE...>" '' ``
*46439007SCharles.Forsythdocument type declaration, and the latter an XML
*46439007SCharles.Forsythstylesheet processing request.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythWhen most applications are processing documents, they
*46439007SCharles.Forsythwill wish to ignore all items other than
*46439007SCharles.Forsyth.CW Tag
*46439007SCharles.Forsythand
*46439007SCharles.Forsyth.CW Text .
*46439007SCharles.ForsythTo this end, it is conventional to define a ``front-end'' function
*46439007SCharles.Forsythto return desired items, discard others, and take an appropriate
*46439007SCharles.Forsythaction when an error is encountered. Here's an example:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythnextitem(p: ref Parser): ref Item
*46439007SCharles.Forsyth{
*46439007SCharles.Forsyth    while ((gi := p.next()) != nil) {
*46439007SCharles.Forsyth        pick i := gi {
*46439007SCharles.Forsyth        Error =>
*46439007SCharles.Forsyth            sys->print("error at %s:%d: %s\n",
*46439007SCharles.Forsyth                i.loc.systemid, i.loc.line, i.msg);
*46439007SCharles.Forsyth            exit;
*46439007SCharles.Forsyth        Process =>
*46439007SCharles.Forsyth            ;   # ignore
*46439007SCharles.Forsyth        Stylesheet  =>
*46439007SCharles.Forsyth            ;   # ignore
*46439007SCharles.Forsyth        Doctype =>
*46439007SCharles.Forsyth            ;   # ignore
*46439007SCharles.Forsyth        * =>
*46439007SCharles.Forsyth            return gi;
*46439007SCharles.Forsyth        }
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth    return nil;
*46439007SCharles.Forsyth}
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythWhen
*46439007SCharles.Forsyth.CW nextitem
*46439007SCharles.Forsythencounters an error, it exits; it might instead handle the
*46439007SCharles.Forsytherror another way, say by raising an exception to be caught at the
*46439007SCharles.Forsythoutermost level of the parsing code.
*46439007SCharles.Forsyth.SH
*46439007SCharles.ForsythA small example
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythSuppose we have an XML document that contains some data that we would
*46439007SCharles.Forsythlike to extract, ignoring the rest of the document. For this example we will
*46439007SCharles.Forsythassume that the data is held within
*46439007SCharles.Forsyth.CW <data>
*46439007SCharles.Forsythtags, which contain zero or more
*46439007SCharles.Forsyth.CW <item>
*46439007SCharles.Forsythtags, holding the actual data as text within them.
*46439007SCharles.ForsythTags that we do not recognize we choose to ignore.
*46439007SCharles.ForsythSo for example, given the following XML document:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsyth<metadata>
*46439007SCharles.Forsyth    <a>hello</a>
*46439007SCharles.Forsyth    <b>goodbye</b>
*46439007SCharles.Forsyth</metadata>
*46439007SCharles.Forsyth<data>
*46439007SCharles.Forsyth    <item>one</item>
*46439007SCharles.Forsyth    <item>two</item>
*46439007SCharles.Forsyth    <item>three</item>
*46439007SCharles.Forsyth</data>
*46439007SCharles.Forsyth<data>
*46439007SCharles.Forsyth    <item>four</item>
*46439007SCharles.Forsyth</data>
*46439007SCharles.Forsyth.P2
*46439007SCharles.Forsythwe wish to extract all the data items, but ignore everything inside
*46439007SCharles.Forsyththe
*46439007SCharles.Forsyth.CW <metadata>
*46439007SCharles.Forsythtag. First, let us define another little convenience function to get
*46439007SCharles.Forsyththe next XML tag, ignoring extraneous items:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythnexttag(p: ref Parser): ref Item.Tag
*46439007SCharles.Forsyth{
*46439007SCharles.Forsyth    while ((gi := nextitem(p)) != nil) {
*46439007SCharles.Forsyth        pick i := gi {
*46439007SCharles.Forsyth        Tag =>
*46439007SCharles.Forsyth            return i;
*46439007SCharles.Forsyth        }
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth    return nil;
*46439007SCharles.Forsyth}
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythAssuming that the document has already been opened,
*46439007SCharles.Forsyththe following function scans through the document, looking
*46439007SCharles.Forsythfor top level
*46439007SCharles.Forsyth.CW <data>
*46439007SCharles.Forsythtags, and ignoring others:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythdocument(p: ref Parser)
*46439007SCharles.Forsyth{
*46439007SCharles.Forsyth    while ((i := nexttag(p)) != nil) {
*46439007SCharles.Forsyth        if (i.name == "data") {
*46439007SCharles.Forsyth            p.down();
*46439007SCharles.Forsyth            data(p);
*46439007SCharles.Forsyth            p.up();
*46439007SCharles.Forsyth        }
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth}
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythThe function to parse a
*46439007SCharles.Forsyth.CW <data>
*46439007SCharles.Forsythtag is almost as straightforward; it scans for
*46439007SCharles.Forsyth.CW <item>
*46439007SCharles.Forsythtags and extracts any textual data contained therein:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythdata(p: ref Parser)
*46439007SCharles.Forsyth{
*46439007SCharles.Forsyth    while ((i := nexttag(p)) != nil) {
*46439007SCharles.Forsyth        if (i.name == "item") {
*46439007SCharles.Forsyth            p.down();
*46439007SCharles.Forsyth            if ((gni := p.next()) != nil) {
*46439007SCharles.Forsyth                pick ni := gni {
*46439007SCharles.Forsyth                Text =>
*46439007SCharles.Forsyth                    sys->print("item data: %s\n", ni.ch);
*46439007SCharles.Forsyth                }
*46439007SCharles.Forsyth            }
*46439007SCharles.Forsyth            p.up();
*46439007SCharles.Forsyth        }
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth}
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythThe above program is all very well and works fine, but
*46439007SCharles.Forsythsuppose that the document that we are parsing is very
*46439007SCharles.Forsythlarge, with data items scattered through its length, and that
*46439007SCharles.Forsythwe wish to access those items in an order that is not necessarily
*46439007SCharles.Forsyththat in which they appear in the document.
*46439007SCharles.ForsythThis is quite straightforward; every time we see a
*46439007SCharles.Forsythdata item, we record the current position with a mark.
*46439007SCharles.ForsythAssuming the global declaration:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythmarks: list of ref Mark;
*46439007SCharles.Forsyth.P2
*46439007SCharles.Forsyththe
*46439007SCharles.Forsyth.CW document
*46439007SCharles.Forsythfunction might become:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsythdocument(p: ref Parser)
*46439007SCharles.Forsyth{
*46439007SCharles.Forsyth    while ((i := nexttag(p)) != nil) {
*46439007SCharles.Forsyth        if (i.name == "data") {
*46439007SCharles.Forsyth            p.down();
*46439007SCharles.Forsyth            marks = p.mark() :: marks;
*46439007SCharles.Forsyth            p.up();
*46439007SCharles.Forsyth        }
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth}
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythAt some later time, we can access the data items arbitrarily,
*46439007SCharles.Forsythfor instance:
*46439007SCharles.Forsyth.P1
*46439007SCharles.Forsyth    for (m := marks; m != nil; m = tl m) {
*46439007SCharles.Forsyth        p.goto(hd m);
*46439007SCharles.Forsyth        data(p);
*46439007SCharles.Forsyth    }
*46439007SCharles.Forsyth.P2
*46439007SCharles.ForsythIf we wish to store the data item marks in some external index
*46439007SCharles.Forsyth(in a file, perhaps), the
*46439007SCharles.Forsyth.CW Mark
*46439007SCharles.Forsythadt provides a
*46439007SCharles.Forsyth.CW str
*46439007SCharles.Forsythfunction which returns a string representation of the mark.
*46439007SCharles.Forsyth.CW Parser 's
*46439007SCharles.Forsyth.CW str2mark
*46439007SCharles.Forsythfunction can later be used to recover the mark. Care must
*46439007SCharles.Forsythbe taken that the document it refers to has not been changed,
*46439007SCharles.Forsythotherwise it is likely that the mark will be invalid.
*46439007SCharles.Forsyth.SH
*46439007SCharles.ForsythThe eBook implementation
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythThe Open eBook reader software uses the primitives described above
*46439007SCharles.Forsythto maintain display-page-based access to arbitrarily large documents
*46439007SCharles.Forsythwhile trying to bound memory usage.
*46439007SCharles.ForsythUnfortunately it is difficult to unconditionally bound memory usage,
*46439007SCharles.Forsythgiven that any element in an XML document may be arbitrarily
*46439007SCharles.Forsythlarge. For instance a perfectly legal document might have 100MB
*46439007SCharles.Forsythof continuous text containing no tags whatsoever. The described
*46439007SCharles.Forsythinterface would attempt to put all this text in one single item, rapidly
*46439007SCharles.Forsythrunning out of memory! Similar types of problems can occur when
*46439007SCharles.Forsythgathering the items necessary to format a particular tag.
*46439007SCharles.ForsythFor instance, to format the first row of a table, it is necessary to lay out
*46439007SCharles.Forsyththe entire table to determine the column widths.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythI chose to make the simplifying assumption that top-level items within
*46439007SCharles.Forsyththe document would be small enough to fit into memory.
*46439007SCharles.ForsythFrom the point of view of the display module, the document
*46439007SCharles.Forsythlooks like a simple sequence of items, one after another.
*46439007SCharles.ForsythOne item might cover more than one page, in which case a different
*46439007SCharles.Forsythpart of it will be displayed on each of those pages.
*46439007SCharles.Forsyth.LP
*46439007SCharles.ForsythOne difficulty is that the displayed size of an item depends on many
*46439007SCharles.Forsythfactors, such as stylesheet parameters, size of installed fonts, etc.
*46439007SCharles.ForsythWhen a document is read, the page index must have been created
*46439007SCharles.Forsythfrom the same document with the same parameters. It is difficult in
*46439007SCharles.Forsythgeneral to enumerate all the relevant parameters; they would need
*46439007SCharles.Forsythto be stored inside, or alongside the index; any change would invalidate
*46439007SCharles.Forsyththe index. Instead of doing this, as the document is being displayed,
*46439007SCharles.Forsyththe eBook display program constantly checks to see if the results
*46439007SCharles.Forsythit is getting from the index match with the results it is getting
*46439007SCharles.Forsythwhen actually laying out the document. If the results differ, the
*46439007SCharles.Forsythindex is remade; the discrepancy will hopefully not be noticed by
*46439007SCharles.Forsyththe user!