xref: /inferno-os/doc/ebookimp.ms (revision 46439007cf417cbd9ac8049bb4122c890097a0fa)
1*46439007SCharles.Forsyth.TL
2*46439007SCharles.ForsythNavigating Large XML Documents on Small Devices
3*46439007SCharles.Forsyth.AU
4*46439007SCharles.ForsythRoger Peppe
5*46439007SCharles.Forsyth.AI
6*46439007SCharles.ForsythVita Nuova
7*46439007SCharles.Forsyth.br
8*46439007SCharles.ForsythApril 2002
9*46439007SCharles.Forsyth.AB
10*46439007SCharles.ForsythBrowsing eBooks on platforms with limited memory presents an
11*46439007SCharles.Forsythinteresting problem: how can memory usage be bounded despite
12*46439007SCharles.Forsyththe need to view documents that may be much larger than the
13*46439007SCharles.Forsythavailable memory. A simple interface to an XML parser enables
14*46439007SCharles.Forsyththis whilst retaining much of the ease of access afforded
15*46439007SCharles.Forsythby XML parsers that read all of a document into memory at once.
16*46439007SCharles.Forsyth.AE
17*46439007SCharles.Forsyth.SH
18*46439007SCharles.ForsythIntroduction
19*46439007SCharles.Forsyth.LP
20*46439007SCharles.ForsythThe Open Ebook Publication Structure was devised by the Open Ebook Forum
21*46439007SCharles.Forsythin order to ``provide a specification for representing the content of electronic
22*46439007SCharles.Forsythbooks''. It is based on many existing standards, notably XML and HTML.
23*46439007SCharles.ForsythAn Open eBook publication consists of a set of documents bound together
24*46439007SCharles.Forsythwith an Open eBook package file which enumerates all the documents,
25*46439007SCharles.Forsythpictures and other items that make up the book
26*46439007SCharles.Forsyth.LP
27*46439007SCharles.ForsythThe underlying document format is essentially HTML compatible,
28*46439007SCharles.Forsythwhich is where the first problem arises: HTML was not designed to
29*46439007SCharles.Forsythmake it easy to view partial sections of a document. Conventionally
30*46439007SCharles.Forsythan entire HTML document is read in at once and rendered onto
31*46439007SCharles.Forsyththe device. When viewing an eBook on a limited-memory device,
32*46439007SCharles.Forsythhowever, this may not be possible; books tend to be fairly large.
33*46439007SCharles.ForsythFor such a device, the ideal format would keep the book itself
34*46439007SCharles.Forsythin non-volatile storage (e.g. flash or disk) and make it possible
35*46439007SCharles.Forsythfor reader to seek to an arbitrary position in the book and render
36*46439007SCharles.Forsythwhat it finds there.
37*46439007SCharles.Forsyth.LP
38*46439007SCharles.ForsythThis is not possible in an HTML or XML document, as the
39*46439007SCharles.Forsytharbitrarily nested nature of the format means that every
40*46439007SCharles.Forsythposition in the document has some unknown surrounding context,
41*46439007SCharles.Forsythwhich cannot be discovered without reading sequentially through
42*46439007SCharles.Forsyththe document from the beginning.
43*46439007SCharles.Forsyth.SH
44*46439007SCharles.ForsythSAX and DOM
45*46439007SCharles.Forsyth.LP
46*46439007SCharles.ForsythThere are two conventional programming interfaces to an XML
47*46439007SCharles.Forsythparser. A SAX parser provides a stream of XML entities, leaving
48*46439007SCharles.Forsythit up to the application to maintain the context. It is not possible
49*46439007SCharles.Forsythto rewind the stream, except, perhaps, to the beginning.
50*46439007SCharles.ForsythUsing a SAX parser is
51*46439007SCharles.Forsythfairly straightforward, but awkward: the stream-like nature
52*46439007SCharles.Forsythof the interface does not map well to the tree-like structure
53*46439007SCharles.Forsyththat is XML. A DOM parser reads a whole document into an internal
54*46439007SCharles.Forsythdata structure representation, so a program can treat it exactly
55*46439007SCharles.Forsythas a tree. This also enables a program to access parts of the
56*46439007SCharles.Forsythdocument in an arbitrary order.
57*46439007SCharles.ForsythThe DOM approach is all very well for small documents, but for large
58*46439007SCharles.Forsythdocuments the memory usage can rapidly grow to exceed
59*46439007SCharles.Forsyththe available memory capacity. For eBook documents, this is unacceptable.
60*46439007SCharles.Forsyth.SH
61*46439007SCharles.ForsythA different approach
62*46439007SCharles.Forsyth.LP
63*46439007SCharles.ForsythThe XML parser used in the eBook browser is akin to a SAX parser,
64*46439007SCharles.Forsythin that only a little of the XML structure is held in memory at one time.
65*46439007SCharles.ForsythThe first significant difference is that the XML entities returned are
66*46439007SCharles.Forsythtaken from one level of the tree - if the program does not wish to
67*46439007SCharles.Forsythsee the contents of a particular XML tag, it is trivial to skip over.
68*46439007SCharles.ForsythThe second significant difference is that random access is possible.
69*46439007SCharles.ForsythThis possibility comes from the observation that if we have visited
70*46439007SCharles.Forsytha part of the document we can record the context that we found there
71*46439007SCharles.Forsythand restore it later if necessary. In this scheme, if we wish to return later to
72*46439007SCharles.Forsytha part of a document that we are currently at, we can create a ``mark'',
73*46439007SCharles.Forsytha token that holds the current context; at some later time we can use
74*46439007SCharles.Forsyththat mark to return to this position.
75*46439007SCharles.Forsyth.LP
76*46439007SCharles.ForsythThe eBook browser uses this technique to enable random access
77*46439007SCharles.Forsythto the document on a page-by-page basis. Moreover a mark
78*46439007SCharles.Forsythcan be written to external storage, thus allowing an external
79*46439007SCharles.Forsyth``index'' into the document so it is not always necessary to
80*46439007SCharles.Forsythread the entire document from the start in order to jump to a particular
81*46439007SCharles.Forsythpage in that document.
82*46439007SCharles.Forsyth.SH
83*46439007SCharles.ForsythThe programming interface
84*46439007SCharles.Forsyth.LP
85*46439007SCharles.ForsythThe interface is implemented by a module named
86*46439007SCharles.Forsyth.CW Xml ,
87*46439007SCharles.Forsythwhich provides a
88*46439007SCharles.Forsyth.CW Parser
89*46439007SCharles.Forsythadt which gives access to the contents of an XML document.
90*46439007SCharles.ForsythXml items are represented by an
91*46439007SCharles.Forsyth.CW Item
92*46439007SCharles.Forsythpick adt with one branch of the pick corresponding to each
93*46439007SCharles.Forsythtype of item that might be encountered.
94*46439007SCharles.Forsyth.LP
95*46439007SCharles.ForsythThe interface to the parser looks like this:
96*46439007SCharles.Forsyth.P1
97*46439007SCharles.Forsythopen: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string);
98*46439007SCharles.ForsythParser: adt {
99*46439007SCharles.Forsyth    next:       fn(p: self ref Parser): ref Item;
100*46439007SCharles.Forsyth    down:   fn(p: self ref Parser);
101*46439007SCharles.Forsyth    up:     fn(p: self ref Parser);
102*46439007SCharles.Forsyth    mark:   fn(p: self ref Parser): ref Mark;
103*46439007SCharles.Forsyth    atmark: fn(p: self ref Parser, m: ref Mark): int;
104*46439007SCharles.Forsyth    goto:   fn(p: self ref Parser, m: ref Mark);
105*46439007SCharles.Forsyth    str2mark:   fn(p: self ref Parser, s: string): ref Mark;
106*46439007SCharles.Forsyth};
107*46439007SCharles.Forsyth.P2
108*46439007SCharles.ForsythTo start parsing an XML document, it must first be
109*46439007SCharles.Forsyth.CW open ed;
110*46439007SCharles.Forsyth.CW warning
111*46439007SCharles.Forsythis a channel on which non-fatal error messages will be sent
112*46439007SCharles.Forsythif they are encountered during the parsing of the document.
113*46439007SCharles.ForsythIt can be nil, in which case warnings are ignored.
114*46439007SCharles.ForsythIf the document is opened successfully, a new
115*46439007SCharles.Forsyth.CW Parser
116*46439007SCharles.Forsythadt, say
117*46439007SCharles.Forsyth.I p ,
118*46439007SCharles.Forsythis returned.
119*46439007SCharles.ForsythCalling
120*46439007SCharles.Forsyth.CW \fIp\fP.next
121*46439007SCharles.Forsythreturns the next XML item at the current level of the tree. If there
122*46439007SCharles.Forsythare no more items in the current branch at the current level, it
123*46439007SCharles.Forsythreturns
124*46439007SCharles.Forsyth.CW nil .
125*46439007SCharles.ForsythWhen a
126*46439007SCharles.Forsyth.CW Tag
127*46439007SCharles.Forsythitem is returned,
128*46439007SCharles.Forsyth.CW \fIp\fP.down
129*46439007SCharles.Forsythcan be used to descend ``into'' that tag; subsequent calls of
130*46439007SCharles.Forsyth.CW \fIp\fP.next
131*46439007SCharles.Forsythwill return XML items contained within the tag,
132*46439007SCharles.Forsythand
133*46439007SCharles.Forsyth.CW \fIp\fP.up
134*46439007SCharles.Forsythreturns to the previous level.
135*46439007SCharles.Forsyth.LP
136*46439007SCharles.ForsythAn
137*46439007SCharles.Forsyth.CW Item
138*46439007SCharles.Forsythis a pick adt:
139*46439007SCharles.Forsyth.P1
140*46439007SCharles.ForsythItem: adt {
141*46439007SCharles.Forsyth    fileoffset: int;
142*46439007SCharles.Forsyth    pick {
143*46439007SCharles.Forsyth    Tag =>
144*46439007SCharles.Forsyth        name:   string;
145*46439007SCharles.Forsyth        attrs:      Attributes;
146*46439007SCharles.Forsyth    Text =>
147*46439007SCharles.Forsyth        ch:     string;
148*46439007SCharles.Forsyth        ws1, ws2: int;
149*46439007SCharles.Forsyth    Process =>
150*46439007SCharles.Forsyth        target: string;
151*46439007SCharles.Forsyth        data:       string;
152*46439007SCharles.Forsyth    Doctype =>
153*46439007SCharles.Forsyth        name:   string;
154*46439007SCharles.Forsyth        public: int;
155*46439007SCharles.Forsyth        params: list of string;
156*46439007SCharles.Forsyth    Stylesheet =>
157*46439007SCharles.Forsyth        attrs:      Attributes;
158*46439007SCharles.Forsyth    Error =>
159*46439007SCharles.Forsyth        loc:        Locator;
160*46439007SCharles.Forsyth        msg:        string;
161*46439007SCharles.Forsyth    }
162*46439007SCharles.Forsyth};
163*46439007SCharles.Forsyth.P2
164*46439007SCharles.Forsyth.CW Item.Tag
165*46439007SCharles.Forsythrepresents a XML tag, empty or not. The XML
166*46439007SCharles.Forsythfragments
167*46439007SCharles.Forsyth.CW "<tag></tag>" '' ``
168*46439007SCharles.Forsythand
169*46439007SCharles.Forsyth.CW "<tag />" '' ``
170*46439007SCharles.Forsythlook identical from the point of view of this interface.
171*46439007SCharles.ForsythA
172*46439007SCharles.Forsyth.CW Text
173*46439007SCharles.Forsythitem holds text found in between tags, with adjacent whitespaces merged
174*46439007SCharles.Forsythand whitespace at the beginning and end of the text elided.
175*46439007SCharles.Forsyth.CW Ws1
176*46439007SCharles.Forsythand
177*46439007SCharles.Forsyth.CW ws2
178*46439007SCharles.Forsythare non-zero if there was originally whitespace at the beginning
179*46439007SCharles.Forsythor end of the text respectively.
180*46439007SCharles.Forsyth.CW Process
181*46439007SCharles.Forsythrepresents an XML processing request, as found between
182*46439007SCharles.Forsyth.CW "<?....?>" '' ``
183*46439007SCharles.Forsythdelimiters.
184*46439007SCharles.Forsyth.CW Doctype
185*46439007SCharles.Forsythand
186*46439007SCharles.Forsyth.CW Stylesheet
187*46439007SCharles.Forsythare items found in an XML document's prolog, the
188*46439007SCharles.Forsythformer representing a
189*46439007SCharles.Forsyth.CW "<!DOCTYPE...>" '' ``
190*46439007SCharles.Forsythdocument type declaration, and the latter an XML
191*46439007SCharles.Forsythstylesheet processing request.
192*46439007SCharles.Forsyth.LP
193*46439007SCharles.ForsythWhen most applications are processing documents, they
194*46439007SCharles.Forsythwill wish to ignore all items other than
195*46439007SCharles.Forsyth.CW Tag
196*46439007SCharles.Forsythand
197*46439007SCharles.Forsyth.CW Text .
198*46439007SCharles.ForsythTo this end, it is conventional to define a ``front-end'' function
199*46439007SCharles.Forsythto return desired items, discard others, and take an appropriate
200*46439007SCharles.Forsythaction when an error is encountered. Here's an example:
201*46439007SCharles.Forsyth.P1
202*46439007SCharles.Forsythnextitem(p: ref Parser): ref Item
203*46439007SCharles.Forsyth{
204*46439007SCharles.Forsyth    while ((gi := p.next()) != nil) {
205*46439007SCharles.Forsyth        pick i := gi {
206*46439007SCharles.Forsyth        Error =>
207*46439007SCharles.Forsyth            sys->print("error at %s:%d: %s\n",
208*46439007SCharles.Forsyth                i.loc.systemid, i.loc.line, i.msg);
209*46439007SCharles.Forsyth            exit;
210*46439007SCharles.Forsyth        Process =>
211*46439007SCharles.Forsyth            ;   # ignore
212*46439007SCharles.Forsyth        Stylesheet  =>
213*46439007SCharles.Forsyth            ;   # ignore
214*46439007SCharles.Forsyth        Doctype =>
215*46439007SCharles.Forsyth            ;   # ignore
216*46439007SCharles.Forsyth        * =>
217*46439007SCharles.Forsyth            return gi;
218*46439007SCharles.Forsyth        }
219*46439007SCharles.Forsyth    }
220*46439007SCharles.Forsyth    return nil;
221*46439007SCharles.Forsyth}
222*46439007SCharles.Forsyth.P2
223*46439007SCharles.ForsythWhen
224*46439007SCharles.Forsyth.CW nextitem
225*46439007SCharles.Forsythencounters an error, it exits; it might instead handle the
226*46439007SCharles.Forsytherror another way, say by raising an exception to be caught at the
227*46439007SCharles.Forsythoutermost level of the parsing code.
228*46439007SCharles.Forsyth.SH
229*46439007SCharles.ForsythA small example
230*46439007SCharles.Forsyth.LP
231*46439007SCharles.ForsythSuppose we have an XML document that contains some data that we would
232*46439007SCharles.Forsythlike to extract, ignoring the rest of the document. For this example we will
233*46439007SCharles.Forsythassume that the data is held within
234*46439007SCharles.Forsyth.CW <data>
235*46439007SCharles.Forsythtags, which contain zero or more
236*46439007SCharles.Forsyth.CW <item>
237*46439007SCharles.Forsythtags, holding the actual data as text within them.
238*46439007SCharles.ForsythTags that we do not recognize we choose to ignore.
239*46439007SCharles.ForsythSo for example, given the following XML document:
240*46439007SCharles.Forsyth.P1
241*46439007SCharles.Forsyth<metadata>
242*46439007SCharles.Forsyth    <a>hello</a>
243*46439007SCharles.Forsyth    <b>goodbye</b>
244*46439007SCharles.Forsyth</metadata>
245*46439007SCharles.Forsyth<data>
246*46439007SCharles.Forsyth    <item>one</item>
247*46439007SCharles.Forsyth    <item>two</item>
248*46439007SCharles.Forsyth    <item>three</item>
249*46439007SCharles.Forsyth</data>
250*46439007SCharles.Forsyth<data>
251*46439007SCharles.Forsyth    <item>four</item>
252*46439007SCharles.Forsyth</data>
253*46439007SCharles.Forsyth.P2
254*46439007SCharles.Forsythwe wish to extract all the data items, but ignore everything inside
255*46439007SCharles.Forsyththe
256*46439007SCharles.Forsyth.CW <metadata>
257*46439007SCharles.Forsythtag. First, let us define another little convenience function to get
258*46439007SCharles.Forsyththe next XML tag, ignoring extraneous items:
259*46439007SCharles.Forsyth.P1
260*46439007SCharles.Forsythnexttag(p: ref Parser): ref Item.Tag
261*46439007SCharles.Forsyth{
262*46439007SCharles.Forsyth    while ((gi := nextitem(p)) != nil) {
263*46439007SCharles.Forsyth        pick i := gi {
264*46439007SCharles.Forsyth        Tag =>
265*46439007SCharles.Forsyth            return i;
266*46439007SCharles.Forsyth        }
267*46439007SCharles.Forsyth    }
268*46439007SCharles.Forsyth    return nil;
269*46439007SCharles.Forsyth}
270*46439007SCharles.Forsyth.P2
271*46439007SCharles.ForsythAssuming that the document has already been opened,
272*46439007SCharles.Forsyththe following function scans through the document, looking
273*46439007SCharles.Forsythfor top level
274*46439007SCharles.Forsyth.CW <data>
275*46439007SCharles.Forsythtags, and ignoring others:
276*46439007SCharles.Forsyth.P1
277*46439007SCharles.Forsythdocument(p: ref Parser)
278*46439007SCharles.Forsyth{
279*46439007SCharles.Forsyth    while ((i := nexttag(p)) != nil) {
280*46439007SCharles.Forsyth        if (i.name == "data") {
281*46439007SCharles.Forsyth            p.down();
282*46439007SCharles.Forsyth            data(p);
283*46439007SCharles.Forsyth            p.up();
284*46439007SCharles.Forsyth        }
285*46439007SCharles.Forsyth    }
286*46439007SCharles.Forsyth}
287*46439007SCharles.Forsyth.P2
288*46439007SCharles.ForsythThe function to parse a
289*46439007SCharles.Forsyth.CW <data>
290*46439007SCharles.Forsythtag is almost as straightforward; it scans for
291*46439007SCharles.Forsyth.CW <item>
292*46439007SCharles.Forsythtags and extracts any textual data contained therein:
293*46439007SCharles.Forsyth.P1
294*46439007SCharles.Forsythdata(p: ref Parser)
295*46439007SCharles.Forsyth{
296*46439007SCharles.Forsyth    while ((i := nexttag(p)) != nil) {
297*46439007SCharles.Forsyth        if (i.name == "item") {
298*46439007SCharles.Forsyth            p.down();
299*46439007SCharles.Forsyth            if ((gni := p.next()) != nil) {
300*46439007SCharles.Forsyth                pick ni := gni {
301*46439007SCharles.Forsyth                Text =>
302*46439007SCharles.Forsyth                    sys->print("item data: %s\n", ni.ch);
303*46439007SCharles.Forsyth                }
304*46439007SCharles.Forsyth            }
305*46439007SCharles.Forsyth            p.up();
306*46439007SCharles.Forsyth        }
307*46439007SCharles.Forsyth    }
308*46439007SCharles.Forsyth}
309*46439007SCharles.Forsyth.P2
310*46439007SCharles.ForsythThe above program is all very well and works fine, but
311*46439007SCharles.Forsythsuppose that the document that we are parsing is very
312*46439007SCharles.Forsythlarge, with data items scattered through its length, and that
313*46439007SCharles.Forsythwe wish to access those items in an order that is not necessarily
314*46439007SCharles.Forsyththat in which they appear in the document.
315*46439007SCharles.ForsythThis is quite straightforward; every time we see a
316*46439007SCharles.Forsythdata item, we record the current position with a mark.
317*46439007SCharles.ForsythAssuming the global declaration:
318*46439007SCharles.Forsyth.P1
319*46439007SCharles.Forsythmarks: list of ref Mark;
320*46439007SCharles.Forsyth.P2
321*46439007SCharles.Forsyththe
322*46439007SCharles.Forsyth.CW document
323*46439007SCharles.Forsythfunction might become:
324*46439007SCharles.Forsyth.P1
325*46439007SCharles.Forsythdocument(p: ref Parser)
326*46439007SCharles.Forsyth{
327*46439007SCharles.Forsyth    while ((i := nexttag(p)) != nil) {
328*46439007SCharles.Forsyth        if (i.name == "data") {
329*46439007SCharles.Forsyth            p.down();
330*46439007SCharles.Forsyth            marks = p.mark() :: marks;
331*46439007SCharles.Forsyth            p.up();
332*46439007SCharles.Forsyth        }
333*46439007SCharles.Forsyth    }
334*46439007SCharles.Forsyth}
335*46439007SCharles.Forsyth.P2
336*46439007SCharles.ForsythAt some later time, we can access the data items arbitrarily,
337*46439007SCharles.Forsythfor instance:
338*46439007SCharles.Forsyth.P1
339*46439007SCharles.Forsyth    for (m := marks; m != nil; m = tl m) {
340*46439007SCharles.Forsyth        p.goto(hd m);
341*46439007SCharles.Forsyth        data(p);
342*46439007SCharles.Forsyth    }
343*46439007SCharles.Forsyth.P2
344*46439007SCharles.ForsythIf we wish to store the data item marks in some external index
345*46439007SCharles.Forsyth(in a file, perhaps), the
346*46439007SCharles.Forsyth.CW Mark
347*46439007SCharles.Forsythadt provides a
348*46439007SCharles.Forsyth.CW str
349*46439007SCharles.Forsythfunction which returns a string representation of the mark.
350*46439007SCharles.Forsyth.CW Parser 's
351*46439007SCharles.Forsyth.CW str2mark
352*46439007SCharles.Forsythfunction can later be used to recover the mark. Care must
353*46439007SCharles.Forsythbe taken that the document it refers to has not been changed,
354*46439007SCharles.Forsythotherwise it is likely that the mark will be invalid.
355*46439007SCharles.Forsyth.SH
356*46439007SCharles.ForsythThe eBook implementation
357*46439007SCharles.Forsyth.LP
358*46439007SCharles.ForsythThe Open eBook reader software uses the primitives described above
359*46439007SCharles.Forsythto maintain display-page-based access to arbitrarily large documents
360*46439007SCharles.Forsythwhile trying to bound memory usage.
361*46439007SCharles.ForsythUnfortunately it is difficult to unconditionally bound memory usage,
362*46439007SCharles.Forsythgiven that any element in an XML document may be arbitrarily
363*46439007SCharles.Forsythlarge. For instance a perfectly legal document might have 100MB
364*46439007SCharles.Forsythof continuous text containing no tags whatsoever. The described
365*46439007SCharles.Forsythinterface would attempt to put all this text in one single item, rapidly
366*46439007SCharles.Forsythrunning out of memory! Similar types of problems can occur when
367*46439007SCharles.Forsythgathering the items necessary to format a particular tag.
368*46439007SCharles.ForsythFor instance, to format the first row of a table, it is necessary to lay out
369*46439007SCharles.Forsyththe entire table to determine the column widths.
370*46439007SCharles.Forsyth.LP
371*46439007SCharles.ForsythI chose to make the simplifying assumption that top-level items within
372*46439007SCharles.Forsyththe document would be small enough to fit into memory.
373*46439007SCharles.ForsythFrom the point of view of the display module, the document
374*46439007SCharles.Forsythlooks like a simple sequence of items, one after another.
375*46439007SCharles.ForsythOne item might cover more than one page, in which case a different
376*46439007SCharles.Forsythpart of it will be displayed on each of those pages.
377*46439007SCharles.Forsyth.LP
378*46439007SCharles.ForsythOne difficulty is that the displayed size of an item depends on many
379*46439007SCharles.Forsythfactors, such as stylesheet parameters, size of installed fonts, etc.
380*46439007SCharles.ForsythWhen a document is read, the page index must have been created
381*46439007SCharles.Forsythfrom the same document with the same parameters. It is difficult in
382*46439007SCharles.Forsythgeneral to enumerate all the relevant parameters; they would need
383*46439007SCharles.Forsythto be stored inside, or alongside the index; any change would invalidate
384*46439007SCharles.Forsyththe index. Instead of doing this, as the document is being displayed,
385*46439007SCharles.Forsyththe eBook display program constantly checks to see if the results
386*46439007SCharles.Forsythit is getting from the index match with the results it is getting
387*46439007SCharles.Forsythwhen actually laying out the document. If the results differ, the
388*46439007SCharles.Forsythindex is remade; the discrepancy will hopefully not be noticed by
389*46439007SCharles.Forsyththe user!
390