xref: /inferno-os/doc/ebookimp.ms (revision 46439007cf417cbd9ac8049bb4122c890097a0fa)
1.TL
2Navigating Large XML Documents on Small Devices
3.AU
4Roger Peppe
5.AI
6Vita Nuova
7.br
8April 2002
9.AB
10Browsing eBooks on platforms with limited memory presents an
11interesting problem: how can memory usage be bounded despite
12the need to view documents that may be much larger than the
13available memory. A simple interface to an XML parser enables
14this whilst retaining much of the ease of access afforded
15by XML parsers that read all of a document into memory at once.
16.AE
17.SH
18Introduction
19.LP
20The Open Ebook Publication Structure was devised by the Open Ebook Forum
21in order to ``provide a specification for representing the content of electronic
22books''. It is based on many existing standards, notably XML and HTML.
23An Open eBook publication consists of a set of documents bound together
24with an Open eBook package file which enumerates all the documents,
25pictures and other items that make up the book
26.LP
27The underlying document format is essentially HTML compatible,
28which is where the first problem arises: HTML was not designed to
29make it easy to view partial sections of a document. Conventionally
30an entire HTML document is read in at once and rendered onto
31the device. When viewing an eBook on a limited-memory device,
32however, this may not be possible; books tend to be fairly large.
33For such a device, the ideal format would keep the book itself
34in non-volatile storage (e.g. flash or disk) and make it possible
35for reader to seek to an arbitrary position in the book and render
36what it finds there.
37.LP
38This is not possible in an HTML or XML document, as the
39arbitrarily nested nature of the format means that every
40position in the document has some unknown surrounding context,
41which cannot be discovered without reading sequentially through
42the document from the beginning.
43.SH
44SAX and DOM
45.LP
46There are two conventional programming interfaces to an XML
47parser. A SAX parser provides a stream of XML entities, leaving
48it up to the application to maintain the context. It is not possible
49to rewind the stream, except, perhaps, to the beginning.
50Using a SAX parser is
51fairly straightforward, but awkward: the stream-like nature
52of the interface does not map well to the tree-like structure
53that is XML. A DOM parser reads a whole document into an internal
54data structure representation, so a program can treat it exactly
55as a tree. This also enables a program to access parts of the
56document in an arbitrary order.
57The DOM approach is all very well for small documents, but for large
58documents the memory usage can rapidly grow to exceed
59the available memory capacity. For eBook documents, this is unacceptable.
60.SH
61A different approach
62.LP
63The XML parser used in the eBook browser is akin to a SAX parser,
64in that only a little of the XML structure is held in memory at one time.
65The first significant difference is that the XML entities returned are
66taken from one level of the tree - if the program does not wish to
67see the contents of a particular XML tag, it is trivial to skip over.
68The second significant difference is that random access is possible.
69This possibility comes from the observation that if we have visited
70a part of the document we can record the context that we found there
71and restore it later if necessary. In this scheme, if we wish to return later to
72a part of a document that we are currently at, we can create a ``mark'',
73a token that holds the current context; at some later time we can use
74that mark to return to this position.
75.LP
76The eBook browser uses this technique to enable random access
77to the document on a page-by-page basis. Moreover a mark
78can be written to external storage, thus allowing an external
79``index'' into the document so it is not always necessary to
80read the entire document from the start in order to jump to a particular
81page in that document.
82.SH
83The programming interface
84.LP
85The interface is implemented by a module named
86.CW Xml ,
87which provides a
88.CW Parser
89adt which gives access to the contents of an XML document.
90Xml items are represented by an
91.CW Item
92pick adt with one branch of the pick corresponding to each
93type of item that might be encountered.
94.LP
95The interface to the parser looks like this:
96.P1
97open: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string);
98Parser: adt {
99    next:       fn(p: self ref Parser): ref Item;
100    down:   fn(p: self ref Parser);
101    up:     fn(p: self ref Parser);
102    mark:   fn(p: self ref Parser): ref Mark;
103    atmark: fn(p: self ref Parser, m: ref Mark): int;
104    goto:   fn(p: self ref Parser, m: ref Mark);
105    str2mark:   fn(p: self ref Parser, s: string): ref Mark;
106};
107.P2
108To start parsing an XML document, it must first be
109.CW open ed;
110.CW warning
111is a channel on which non-fatal error messages will be sent
112if they are encountered during the parsing of the document.
113It can be nil, in which case warnings are ignored.
114If the document is opened successfully, a new
115.CW Parser
116adt, say
117.I p ,
118is returned.
119Calling
120.CW \fIp\fP.next
121returns the next XML item at the current level of the tree. If there
122are no more items in the current branch at the current level, it
123returns
124.CW nil .
125When a
126.CW Tag
127item is returned,
128.CW \fIp\fP.down
129can be used to descend ``into'' that tag; subsequent calls of
130.CW \fIp\fP.next
131will return XML items contained within the tag,
132and
133.CW \fIp\fP.up
134returns to the previous level.
135.LP
136An
137.CW Item
138is a pick adt:
139.P1
140Item: adt {
141    fileoffset: int;
142    pick {
143    Tag =>
144        name:   string;
145        attrs:      Attributes;
146    Text =>
147        ch:     string;
148        ws1, ws2: int;
149    Process =>
150        target: string;
151        data:       string;
152    Doctype =>
153        name:   string;
154        public: int;
155        params: list of string;
156    Stylesheet =>
157        attrs:      Attributes;
158    Error =>
159        loc:        Locator;
160        msg:        string;
161    }
162};
163.P2
164.CW Item.Tag
165represents a XML tag, empty or not. The XML
166fragments
167.CW "<tag></tag>" '' ``
168and
169.CW "<tag />" '' ``
170look identical from the point of view of this interface.
171A
172.CW Text
173item holds text found in between tags, with adjacent whitespaces merged
174and whitespace at the beginning and end of the text elided.
175.CW Ws1
176and
177.CW ws2
178are non-zero if there was originally whitespace at the beginning
179or end of the text respectively.
180.CW Process
181represents an XML processing request, as found between
182.CW "<?....?>" '' ``
183delimiters.
184.CW Doctype
185and
186.CW Stylesheet
187are items found in an XML document's prolog, the
188former representing a
189.CW "<!DOCTYPE...>" '' ``
190document type declaration, and the latter an XML
191stylesheet processing request.
192.LP
193When most applications are processing documents, they
194will wish to ignore all items other than
195.CW Tag
196and
197.CW Text .
198To this end, it is conventional to define a ``front-end'' function
199to return desired items, discard others, and take an appropriate
200action when an error is encountered. Here's an example:
201.P1
202nextitem(p: ref Parser): ref Item
203{
204    while ((gi := p.next()) != nil) {
205        pick i := gi {
206        Error =>
207            sys->print("error at %s:%d: %s\n",
208                i.loc.systemid, i.loc.line, i.msg);
209            exit;
210        Process =>
211            ;   # ignore
212        Stylesheet  =>
213            ;   # ignore
214        Doctype =>
215            ;   # ignore
216        * =>
217            return gi;
218        }
219    }
220    return nil;
221}
222.P2
223When
224.CW nextitem
225encounters an error, it exits; it might instead handle the
226error another way, say by raising an exception to be caught at the
227outermost level of the parsing code.
228.SH
229A small example
230.LP
231Suppose we have an XML document that contains some data that we would
232like to extract, ignoring the rest of the document. For this example we will
233assume that the data is held within
234.CW <data>
235tags, which contain zero or more
236.CW <item>
237tags, holding the actual data as text within them.
238Tags that we do not recognize we choose to ignore.
239So for example, given the following XML document:
240.P1
241<metadata>
242    <a>hello</a>
243    <b>goodbye</b>
244</metadata>
245<data>
246    <item>one</item>
247    <item>two</item>
248    <item>three</item>
249</data>
250<data>
251    <item>four</item>
252</data>
253.P2
254we wish to extract all the data items, but ignore everything inside
255the
256.CW <metadata>
257tag. First, let us define another little convenience function to get
258the next XML tag, ignoring extraneous items:
259.P1
260nexttag(p: ref Parser): ref Item.Tag
261{
262    while ((gi := nextitem(p)) != nil) {
263        pick i := gi {
264        Tag =>
265            return i;
266        }
267    }
268    return nil;
269}
270.P2
271Assuming that the document has already been opened,
272the following function scans through the document, looking
273for top level
274.CW <data>
275tags, and ignoring others:
276.P1
277document(p: ref Parser)
278{
279    while ((i := nexttag(p)) != nil) {
280        if (i.name == "data") {
281            p.down();
282            data(p);
283            p.up();
284        }
285    }
286}
287.P2
288The function to parse a
289.CW <data>
290tag is almost as straightforward; it scans for
291.CW <item>
292tags and extracts any textual data contained therein:
293.P1
294data(p: ref Parser)
295{
296    while ((i := nexttag(p)) != nil) {
297        if (i.name == "item") {
298            p.down();
299            if ((gni := p.next()) != nil) {
300                pick ni := gni {
301                Text =>
302                    sys->print("item data: %s\n", ni.ch);
303                }
304            }
305            p.up();
306        }
307    }
308}
309.P2
310The above program is all very well and works fine, but
311suppose that the document that we are parsing is very
312large, with data items scattered through its length, and that
313we wish to access those items in an order that is not necessarily
314that in which they appear in the document.
315This is quite straightforward; every time we see a
316data item, we record the current position with a mark.
317Assuming the global declaration:
318.P1
319marks: list of ref Mark;
320.P2
321the
322.CW document
323function might become:
324.P1
325document(p: ref Parser)
326{
327    while ((i := nexttag(p)) != nil) {
328        if (i.name == "data") {
329            p.down();
330            marks = p.mark() :: marks;
331            p.up();
332        }
333    }
334}
335.P2
336At some later time, we can access the data items arbitrarily,
337for instance:
338.P1
339    for (m := marks; m != nil; m = tl m) {
340        p.goto(hd m);
341        data(p);
342    }
343.P2
344If we wish to store the data item marks in some external index
345(in a file, perhaps), the
346.CW Mark
347adt provides a
348.CW str
349function which returns a string representation of the mark.
350.CW Parser 's
351.CW str2mark
352function can later be used to recover the mark. Care must
353be taken that the document it refers to has not been changed,
354otherwise it is likely that the mark will be invalid.
355.SH
356The eBook implementation
357.LP
358The Open eBook reader software uses the primitives described above
359to maintain display-page-based access to arbitrarily large documents
360while trying to bound memory usage.
361Unfortunately it is difficult to unconditionally bound memory usage,
362given that any element in an XML document may be arbitrarily
363large. For instance a perfectly legal document might have 100MB
364of continuous text containing no tags whatsoever. The described
365interface would attempt to put all this text in one single item, rapidly
366running out of memory! Similar types of problems can occur when
367gathering the items necessary to format a particular tag.
368For instance, to format the first row of a table, it is necessary to lay out
369the entire table to determine the column widths.
370.LP
371I chose to make the simplifying assumption that top-level items within
372the document would be small enough to fit into memory.
373From the point of view of the display module, the document
374looks like a simple sequence of items, one after another.
375One item might cover more than one page, in which case a different
376part of it will be displayed on each of those pages.
377.LP
378One difficulty is that the displayed size of an item depends on many
379factors, such as stylesheet parameters, size of installed fonts, etc.
380When a document is read, the page index must have been created
381from the same document with the same parameters. It is difficult in
382general to enumerate all the relevant parameters; they would need
383to be stored inside, or alongside the index; any change would invalidate
384the index. Instead of doing this, as the document is being displayed,
385the eBook display program constantly checks to see if the results
386it is getting from the index match with the results it is getting
387when actually laying out the document. If the results differ, the
388index is remade; the discrepancy will hopefully not be noticed by
389the user!
390