1*46439007SCharles.Forsyth.TL 2*46439007SCharles.ForsythNavigating Large XML Documents on Small Devices 3*46439007SCharles.Forsyth.AU 4*46439007SCharles.ForsythRoger Peppe 5*46439007SCharles.Forsyth.AI 6*46439007SCharles.ForsythVita Nuova 7*46439007SCharles.Forsyth.br 8*46439007SCharles.ForsythApril 2002 9*46439007SCharles.Forsyth.AB 10*46439007SCharles.ForsythBrowsing eBooks on platforms with limited memory presents an 11*46439007SCharles.Forsythinteresting problem: how can memory usage be bounded despite 12*46439007SCharles.Forsyththe need to view documents that may be much larger than the 13*46439007SCharles.Forsythavailable memory. A simple interface to an XML parser enables 14*46439007SCharles.Forsyththis whilst retaining much of the ease of access afforded 15*46439007SCharles.Forsythby XML parsers that read all of a document into memory at once. 16*46439007SCharles.Forsyth.AE 17*46439007SCharles.Forsyth.SH 18*46439007SCharles.ForsythIntroduction 19*46439007SCharles.Forsyth.LP 20*46439007SCharles.ForsythThe Open Ebook Publication Structure was devised by the Open Ebook Forum 21*46439007SCharles.Forsythin order to ``provide a specification for representing the content of electronic 22*46439007SCharles.Forsythbooks''. It is based on many existing standards, notably XML and HTML. 23*46439007SCharles.ForsythAn Open eBook publication consists of a set of documents bound together 24*46439007SCharles.Forsythwith an Open eBook package file which enumerates all the documents, 25*46439007SCharles.Forsythpictures and other items that make up the book 26*46439007SCharles.Forsyth.LP 27*46439007SCharles.ForsythThe underlying document format is essentially HTML compatible, 28*46439007SCharles.Forsythwhich is where the first problem arises: HTML was not designed to 29*46439007SCharles.Forsythmake it easy to view partial sections of a document. Conventionally 30*46439007SCharles.Forsythan entire HTML document is read in at once and rendered onto 31*46439007SCharles.Forsyththe device. When viewing an eBook on a limited-memory device, 32*46439007SCharles.Forsythhowever, this may not be possible; books tend to be fairly large. 33*46439007SCharles.ForsythFor such a device, the ideal format would keep the book itself 34*46439007SCharles.Forsythin non-volatile storage (e.g. flash or disk) and make it possible 35*46439007SCharles.Forsythfor reader to seek to an arbitrary position in the book and render 36*46439007SCharles.Forsythwhat it finds there. 37*46439007SCharles.Forsyth.LP 38*46439007SCharles.ForsythThis is not possible in an HTML or XML document, as the 39*46439007SCharles.Forsytharbitrarily nested nature of the format means that every 40*46439007SCharles.Forsythposition in the document has some unknown surrounding context, 41*46439007SCharles.Forsythwhich cannot be discovered without reading sequentially through 42*46439007SCharles.Forsyththe document from the beginning. 43*46439007SCharles.Forsyth.SH 44*46439007SCharles.ForsythSAX and DOM 45*46439007SCharles.Forsyth.LP 46*46439007SCharles.ForsythThere are two conventional programming interfaces to an XML 47*46439007SCharles.Forsythparser. A SAX parser provides a stream of XML entities, leaving 48*46439007SCharles.Forsythit up to the application to maintain the context. It is not possible 49*46439007SCharles.Forsythto rewind the stream, except, perhaps, to the beginning. 50*46439007SCharles.ForsythUsing a SAX parser is 51*46439007SCharles.Forsythfairly straightforward, but awkward: the stream-like nature 52*46439007SCharles.Forsythof the interface does not map well to the tree-like structure 53*46439007SCharles.Forsyththat is XML. A DOM parser reads a whole document into an internal 54*46439007SCharles.Forsythdata structure representation, so a program can treat it exactly 55*46439007SCharles.Forsythas a tree. This also enables a program to access parts of the 56*46439007SCharles.Forsythdocument in an arbitrary order. 57*46439007SCharles.ForsythThe DOM approach is all very well for small documents, but for large 58*46439007SCharles.Forsythdocuments the memory usage can rapidly grow to exceed 59*46439007SCharles.Forsyththe available memory capacity. For eBook documents, this is unacceptable. 60*46439007SCharles.Forsyth.SH 61*46439007SCharles.ForsythA different approach 62*46439007SCharles.Forsyth.LP 63*46439007SCharles.ForsythThe XML parser used in the eBook browser is akin to a SAX parser, 64*46439007SCharles.Forsythin that only a little of the XML structure is held in memory at one time. 65*46439007SCharles.ForsythThe first significant difference is that the XML entities returned are 66*46439007SCharles.Forsythtaken from one level of the tree - if the program does not wish to 67*46439007SCharles.Forsythsee the contents of a particular XML tag, it is trivial to skip over. 68*46439007SCharles.ForsythThe second significant difference is that random access is possible. 69*46439007SCharles.ForsythThis possibility comes from the observation that if we have visited 70*46439007SCharles.Forsytha part of the document we can record the context that we found there 71*46439007SCharles.Forsythand restore it later if necessary. In this scheme, if we wish to return later to 72*46439007SCharles.Forsytha part of a document that we are currently at, we can create a ``mark'', 73*46439007SCharles.Forsytha token that holds the current context; at some later time we can use 74*46439007SCharles.Forsyththat mark to return to this position. 75*46439007SCharles.Forsyth.LP 76*46439007SCharles.ForsythThe eBook browser uses this technique to enable random access 77*46439007SCharles.Forsythto the document on a page-by-page basis. Moreover a mark 78*46439007SCharles.Forsythcan be written to external storage, thus allowing an external 79*46439007SCharles.Forsyth``index'' into the document so it is not always necessary to 80*46439007SCharles.Forsythread the entire document from the start in order to jump to a particular 81*46439007SCharles.Forsythpage in that document. 82*46439007SCharles.Forsyth.SH 83*46439007SCharles.ForsythThe programming interface 84*46439007SCharles.Forsyth.LP 85*46439007SCharles.ForsythThe interface is implemented by a module named 86*46439007SCharles.Forsyth.CW Xml , 87*46439007SCharles.Forsythwhich provides a 88*46439007SCharles.Forsyth.CW Parser 89*46439007SCharles.Forsythadt which gives access to the contents of an XML document. 90*46439007SCharles.ForsythXml items are represented by an 91*46439007SCharles.Forsyth.CW Item 92*46439007SCharles.Forsythpick adt with one branch of the pick corresponding to each 93*46439007SCharles.Forsythtype of item that might be encountered. 94*46439007SCharles.Forsyth.LP 95*46439007SCharles.ForsythThe interface to the parser looks like this: 96*46439007SCharles.Forsyth.P1 97*46439007SCharles.Forsythopen: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string); 98*46439007SCharles.ForsythParser: adt { 99*46439007SCharles.Forsyth next: fn(p: self ref Parser): ref Item; 100*46439007SCharles.Forsyth down: fn(p: self ref Parser); 101*46439007SCharles.Forsyth up: fn(p: self ref Parser); 102*46439007SCharles.Forsyth mark: fn(p: self ref Parser): ref Mark; 103*46439007SCharles.Forsyth atmark: fn(p: self ref Parser, m: ref Mark): int; 104*46439007SCharles.Forsyth goto: fn(p: self ref Parser, m: ref Mark); 105*46439007SCharles.Forsyth str2mark: fn(p: self ref Parser, s: string): ref Mark; 106*46439007SCharles.Forsyth}; 107*46439007SCharles.Forsyth.P2 108*46439007SCharles.ForsythTo start parsing an XML document, it must first be 109*46439007SCharles.Forsyth.CW open ed; 110*46439007SCharles.Forsyth.CW warning 111*46439007SCharles.Forsythis a channel on which non-fatal error messages will be sent 112*46439007SCharles.Forsythif they are encountered during the parsing of the document. 113*46439007SCharles.ForsythIt can be nil, in which case warnings are ignored. 114*46439007SCharles.ForsythIf the document is opened successfully, a new 115*46439007SCharles.Forsyth.CW Parser 116*46439007SCharles.Forsythadt, say 117*46439007SCharles.Forsyth.I p , 118*46439007SCharles.Forsythis returned. 119*46439007SCharles.ForsythCalling 120*46439007SCharles.Forsyth.CW \fIp\fP.next 121*46439007SCharles.Forsythreturns the next XML item at the current level of the tree. If there 122*46439007SCharles.Forsythare no more items in the current branch at the current level, it 123*46439007SCharles.Forsythreturns 124*46439007SCharles.Forsyth.CW nil . 125*46439007SCharles.ForsythWhen a 126*46439007SCharles.Forsyth.CW Tag 127*46439007SCharles.Forsythitem is returned, 128*46439007SCharles.Forsyth.CW \fIp\fP.down 129*46439007SCharles.Forsythcan be used to descend ``into'' that tag; subsequent calls of 130*46439007SCharles.Forsyth.CW \fIp\fP.next 131*46439007SCharles.Forsythwill return XML items contained within the tag, 132*46439007SCharles.Forsythand 133*46439007SCharles.Forsyth.CW \fIp\fP.up 134*46439007SCharles.Forsythreturns to the previous level. 135*46439007SCharles.Forsyth.LP 136*46439007SCharles.ForsythAn 137*46439007SCharles.Forsyth.CW Item 138*46439007SCharles.Forsythis a pick adt: 139*46439007SCharles.Forsyth.P1 140*46439007SCharles.ForsythItem: adt { 141*46439007SCharles.Forsyth fileoffset: int; 142*46439007SCharles.Forsyth pick { 143*46439007SCharles.Forsyth Tag => 144*46439007SCharles.Forsyth name: string; 145*46439007SCharles.Forsyth attrs: Attributes; 146*46439007SCharles.Forsyth Text => 147*46439007SCharles.Forsyth ch: string; 148*46439007SCharles.Forsyth ws1, ws2: int; 149*46439007SCharles.Forsyth Process => 150*46439007SCharles.Forsyth target: string; 151*46439007SCharles.Forsyth data: string; 152*46439007SCharles.Forsyth Doctype => 153*46439007SCharles.Forsyth name: string; 154*46439007SCharles.Forsyth public: int; 155*46439007SCharles.Forsyth params: list of string; 156*46439007SCharles.Forsyth Stylesheet => 157*46439007SCharles.Forsyth attrs: Attributes; 158*46439007SCharles.Forsyth Error => 159*46439007SCharles.Forsyth loc: Locator; 160*46439007SCharles.Forsyth msg: string; 161*46439007SCharles.Forsyth } 162*46439007SCharles.Forsyth}; 163*46439007SCharles.Forsyth.P2 164*46439007SCharles.Forsyth.CW Item.Tag 165*46439007SCharles.Forsythrepresents a XML tag, empty or not. The XML 166*46439007SCharles.Forsythfragments 167*46439007SCharles.Forsyth.CW "<tag></tag>" '' `` 168*46439007SCharles.Forsythand 169*46439007SCharles.Forsyth.CW "<tag />" '' `` 170*46439007SCharles.Forsythlook identical from the point of view of this interface. 171*46439007SCharles.ForsythA 172*46439007SCharles.Forsyth.CW Text 173*46439007SCharles.Forsythitem holds text found in between tags, with adjacent whitespaces merged 174*46439007SCharles.Forsythand whitespace at the beginning and end of the text elided. 175*46439007SCharles.Forsyth.CW Ws1 176*46439007SCharles.Forsythand 177*46439007SCharles.Forsyth.CW ws2 178*46439007SCharles.Forsythare non-zero if there was originally whitespace at the beginning 179*46439007SCharles.Forsythor end of the text respectively. 180*46439007SCharles.Forsyth.CW Process 181*46439007SCharles.Forsythrepresents an XML processing request, as found between 182*46439007SCharles.Forsyth.CW "<?....?>" '' `` 183*46439007SCharles.Forsythdelimiters. 184*46439007SCharles.Forsyth.CW Doctype 185*46439007SCharles.Forsythand 186*46439007SCharles.Forsyth.CW Stylesheet 187*46439007SCharles.Forsythare items found in an XML document's prolog, the 188*46439007SCharles.Forsythformer representing a 189*46439007SCharles.Forsyth.CW "<!DOCTYPE...>" '' `` 190*46439007SCharles.Forsythdocument type declaration, and the latter an XML 191*46439007SCharles.Forsythstylesheet processing request. 192*46439007SCharles.Forsyth.LP 193*46439007SCharles.ForsythWhen most applications are processing documents, they 194*46439007SCharles.Forsythwill wish to ignore all items other than 195*46439007SCharles.Forsyth.CW Tag 196*46439007SCharles.Forsythand 197*46439007SCharles.Forsyth.CW Text . 198*46439007SCharles.ForsythTo this end, it is conventional to define a ``front-end'' function 199*46439007SCharles.Forsythto return desired items, discard others, and take an appropriate 200*46439007SCharles.Forsythaction when an error is encountered. Here's an example: 201*46439007SCharles.Forsyth.P1 202*46439007SCharles.Forsythnextitem(p: ref Parser): ref Item 203*46439007SCharles.Forsyth{ 204*46439007SCharles.Forsyth while ((gi := p.next()) != nil) { 205*46439007SCharles.Forsyth pick i := gi { 206*46439007SCharles.Forsyth Error => 207*46439007SCharles.Forsyth sys->print("error at %s:%d: %s\n", 208*46439007SCharles.Forsyth i.loc.systemid, i.loc.line, i.msg); 209*46439007SCharles.Forsyth exit; 210*46439007SCharles.Forsyth Process => 211*46439007SCharles.Forsyth ; # ignore 212*46439007SCharles.Forsyth Stylesheet => 213*46439007SCharles.Forsyth ; # ignore 214*46439007SCharles.Forsyth Doctype => 215*46439007SCharles.Forsyth ; # ignore 216*46439007SCharles.Forsyth * => 217*46439007SCharles.Forsyth return gi; 218*46439007SCharles.Forsyth } 219*46439007SCharles.Forsyth } 220*46439007SCharles.Forsyth return nil; 221*46439007SCharles.Forsyth} 222*46439007SCharles.Forsyth.P2 223*46439007SCharles.ForsythWhen 224*46439007SCharles.Forsyth.CW nextitem 225*46439007SCharles.Forsythencounters an error, it exits; it might instead handle the 226*46439007SCharles.Forsytherror another way, say by raising an exception to be caught at the 227*46439007SCharles.Forsythoutermost level of the parsing code. 228*46439007SCharles.Forsyth.SH 229*46439007SCharles.ForsythA small example 230*46439007SCharles.Forsyth.LP 231*46439007SCharles.ForsythSuppose we have an XML document that contains some data that we would 232*46439007SCharles.Forsythlike to extract, ignoring the rest of the document. For this example we will 233*46439007SCharles.Forsythassume that the data is held within 234*46439007SCharles.Forsyth.CW <data> 235*46439007SCharles.Forsythtags, which contain zero or more 236*46439007SCharles.Forsyth.CW <item> 237*46439007SCharles.Forsythtags, holding the actual data as text within them. 238*46439007SCharles.ForsythTags that we do not recognize we choose to ignore. 239*46439007SCharles.ForsythSo for example, given the following XML document: 240*46439007SCharles.Forsyth.P1 241*46439007SCharles.Forsyth<metadata> 242*46439007SCharles.Forsyth <a>hello</a> 243*46439007SCharles.Forsyth <b>goodbye</b> 244*46439007SCharles.Forsyth</metadata> 245*46439007SCharles.Forsyth<data> 246*46439007SCharles.Forsyth <item>one</item> 247*46439007SCharles.Forsyth <item>two</item> 248*46439007SCharles.Forsyth <item>three</item> 249*46439007SCharles.Forsyth</data> 250*46439007SCharles.Forsyth<data> 251*46439007SCharles.Forsyth <item>four</item> 252*46439007SCharles.Forsyth</data> 253*46439007SCharles.Forsyth.P2 254*46439007SCharles.Forsythwe wish to extract all the data items, but ignore everything inside 255*46439007SCharles.Forsyththe 256*46439007SCharles.Forsyth.CW <metadata> 257*46439007SCharles.Forsythtag. First, let us define another little convenience function to get 258*46439007SCharles.Forsyththe next XML tag, ignoring extraneous items: 259*46439007SCharles.Forsyth.P1 260*46439007SCharles.Forsythnexttag(p: ref Parser): ref Item.Tag 261*46439007SCharles.Forsyth{ 262*46439007SCharles.Forsyth while ((gi := nextitem(p)) != nil) { 263*46439007SCharles.Forsyth pick i := gi { 264*46439007SCharles.Forsyth Tag => 265*46439007SCharles.Forsyth return i; 266*46439007SCharles.Forsyth } 267*46439007SCharles.Forsyth } 268*46439007SCharles.Forsyth return nil; 269*46439007SCharles.Forsyth} 270*46439007SCharles.Forsyth.P2 271*46439007SCharles.ForsythAssuming that the document has already been opened, 272*46439007SCharles.Forsyththe following function scans through the document, looking 273*46439007SCharles.Forsythfor top level 274*46439007SCharles.Forsyth.CW <data> 275*46439007SCharles.Forsythtags, and ignoring others: 276*46439007SCharles.Forsyth.P1 277*46439007SCharles.Forsythdocument(p: ref Parser) 278*46439007SCharles.Forsyth{ 279*46439007SCharles.Forsyth while ((i := nexttag(p)) != nil) { 280*46439007SCharles.Forsyth if (i.name == "data") { 281*46439007SCharles.Forsyth p.down(); 282*46439007SCharles.Forsyth data(p); 283*46439007SCharles.Forsyth p.up(); 284*46439007SCharles.Forsyth } 285*46439007SCharles.Forsyth } 286*46439007SCharles.Forsyth} 287*46439007SCharles.Forsyth.P2 288*46439007SCharles.ForsythThe function to parse a 289*46439007SCharles.Forsyth.CW <data> 290*46439007SCharles.Forsythtag is almost as straightforward; it scans for 291*46439007SCharles.Forsyth.CW <item> 292*46439007SCharles.Forsythtags and extracts any textual data contained therein: 293*46439007SCharles.Forsyth.P1 294*46439007SCharles.Forsythdata(p: ref Parser) 295*46439007SCharles.Forsyth{ 296*46439007SCharles.Forsyth while ((i := nexttag(p)) != nil) { 297*46439007SCharles.Forsyth if (i.name == "item") { 298*46439007SCharles.Forsyth p.down(); 299*46439007SCharles.Forsyth if ((gni := p.next()) != nil) { 300*46439007SCharles.Forsyth pick ni := gni { 301*46439007SCharles.Forsyth Text => 302*46439007SCharles.Forsyth sys->print("item data: %s\n", ni.ch); 303*46439007SCharles.Forsyth } 304*46439007SCharles.Forsyth } 305*46439007SCharles.Forsyth p.up(); 306*46439007SCharles.Forsyth } 307*46439007SCharles.Forsyth } 308*46439007SCharles.Forsyth} 309*46439007SCharles.Forsyth.P2 310*46439007SCharles.ForsythThe above program is all very well and works fine, but 311*46439007SCharles.Forsythsuppose that the document that we are parsing is very 312*46439007SCharles.Forsythlarge, with data items scattered through its length, and that 313*46439007SCharles.Forsythwe wish to access those items in an order that is not necessarily 314*46439007SCharles.Forsyththat in which they appear in the document. 315*46439007SCharles.ForsythThis is quite straightforward; every time we see a 316*46439007SCharles.Forsythdata item, we record the current position with a mark. 317*46439007SCharles.ForsythAssuming the global declaration: 318*46439007SCharles.Forsyth.P1 319*46439007SCharles.Forsythmarks: list of ref Mark; 320*46439007SCharles.Forsyth.P2 321*46439007SCharles.Forsyththe 322*46439007SCharles.Forsyth.CW document 323*46439007SCharles.Forsythfunction might become: 324*46439007SCharles.Forsyth.P1 325*46439007SCharles.Forsythdocument(p: ref Parser) 326*46439007SCharles.Forsyth{ 327*46439007SCharles.Forsyth while ((i := nexttag(p)) != nil) { 328*46439007SCharles.Forsyth if (i.name == "data") { 329*46439007SCharles.Forsyth p.down(); 330*46439007SCharles.Forsyth marks = p.mark() :: marks; 331*46439007SCharles.Forsyth p.up(); 332*46439007SCharles.Forsyth } 333*46439007SCharles.Forsyth } 334*46439007SCharles.Forsyth} 335*46439007SCharles.Forsyth.P2 336*46439007SCharles.ForsythAt some later time, we can access the data items arbitrarily, 337*46439007SCharles.Forsythfor instance: 338*46439007SCharles.Forsyth.P1 339*46439007SCharles.Forsyth for (m := marks; m != nil; m = tl m) { 340*46439007SCharles.Forsyth p.goto(hd m); 341*46439007SCharles.Forsyth data(p); 342*46439007SCharles.Forsyth } 343*46439007SCharles.Forsyth.P2 344*46439007SCharles.ForsythIf we wish to store the data item marks in some external index 345*46439007SCharles.Forsyth(in a file, perhaps), the 346*46439007SCharles.Forsyth.CW Mark 347*46439007SCharles.Forsythadt provides a 348*46439007SCharles.Forsyth.CW str 349*46439007SCharles.Forsythfunction which returns a string representation of the mark. 350*46439007SCharles.Forsyth.CW Parser 's 351*46439007SCharles.Forsyth.CW str2mark 352*46439007SCharles.Forsythfunction can later be used to recover the mark. Care must 353*46439007SCharles.Forsythbe taken that the document it refers to has not been changed, 354*46439007SCharles.Forsythotherwise it is likely that the mark will be invalid. 355*46439007SCharles.Forsyth.SH 356*46439007SCharles.ForsythThe eBook implementation 357*46439007SCharles.Forsyth.LP 358*46439007SCharles.ForsythThe Open eBook reader software uses the primitives described above 359*46439007SCharles.Forsythto maintain display-page-based access to arbitrarily large documents 360*46439007SCharles.Forsythwhile trying to bound memory usage. 361*46439007SCharles.ForsythUnfortunately it is difficult to unconditionally bound memory usage, 362*46439007SCharles.Forsythgiven that any element in an XML document may be arbitrarily 363*46439007SCharles.Forsythlarge. For instance a perfectly legal document might have 100MB 364*46439007SCharles.Forsythof continuous text containing no tags whatsoever. The described 365*46439007SCharles.Forsythinterface would attempt to put all this text in one single item, rapidly 366*46439007SCharles.Forsythrunning out of memory! Similar types of problems can occur when 367*46439007SCharles.Forsythgathering the items necessary to format a particular tag. 368*46439007SCharles.ForsythFor instance, to format the first row of a table, it is necessary to lay out 369*46439007SCharles.Forsyththe entire table to determine the column widths. 370*46439007SCharles.Forsyth.LP 371*46439007SCharles.ForsythI chose to make the simplifying assumption that top-level items within 372*46439007SCharles.Forsyththe document would be small enough to fit into memory. 373*46439007SCharles.ForsythFrom the point of view of the display module, the document 374*46439007SCharles.Forsythlooks like a simple sequence of items, one after another. 375*46439007SCharles.ForsythOne item might cover more than one page, in which case a different 376*46439007SCharles.Forsythpart of it will be displayed on each of those pages. 377*46439007SCharles.Forsyth.LP 378*46439007SCharles.ForsythOne difficulty is that the displayed size of an item depends on many 379*46439007SCharles.Forsythfactors, such as stylesheet parameters, size of installed fonts, etc. 380*46439007SCharles.ForsythWhen a document is read, the page index must have been created 381*46439007SCharles.Forsythfrom the same document with the same parameters. It is difficult in 382*46439007SCharles.Forsythgeneral to enumerate all the relevant parameters; they would need 383*46439007SCharles.Forsythto be stored inside, or alongside the index; any change would invalidate 384*46439007SCharles.Forsyththe index. Instead of doing this, as the document is being displayed, 385*46439007SCharles.Forsyththe eBook display program constantly checks to see if the results 386*46439007SCharles.Forsythit is getting from the index match with the results it is getting 387*46439007SCharles.Forsythwhen actually laying out the document. If the results differ, the 388*46439007SCharles.Forsythindex is remade; the discrepancy will hopefully not be noticed by 389*46439007SCharles.Forsyththe user! 390