1.TL 2Navigating Large XML Documents on Small Devices 3.AU 4Roger Peppe 5.AI 6Vita Nuova 7.br 8April 2002 9.AB 10Browsing eBooks on platforms with limited memory presents an 11interesting problem: how can memory usage be bounded despite 12the need to view documents that may be much larger than the 13available memory. A simple interface to an XML parser enables 14this whilst retaining much of the ease of access afforded 15by XML parsers that read all of a document into memory at once. 16.AE 17.SH 18Introduction 19.LP 20The Open Ebook Publication Structure was devised by the Open Ebook Forum 21in order to ``provide a specification for representing the content of electronic 22books''. It is based on many existing standards, notably XML and HTML. 23An Open eBook publication consists of a set of documents bound together 24with an Open eBook package file which enumerates all the documents, 25pictures and other items that make up the book 26.LP 27The underlying document format is essentially HTML compatible, 28which is where the first problem arises: HTML was not designed to 29make it easy to view partial sections of a document. Conventionally 30an entire HTML document is read in at once and rendered onto 31the device. When viewing an eBook on a limited-memory device, 32however, this may not be possible; books tend to be fairly large. 33For such a device, the ideal format would keep the book itself 34in non-volatile storage (e.g. flash or disk) and make it possible 35for reader to seek to an arbitrary position in the book and render 36what it finds there. 37.LP 38This is not possible in an HTML or XML document, as the 39arbitrarily nested nature of the format means that every 40position in the document has some unknown surrounding context, 41which cannot be discovered without reading sequentially through 42the document from the beginning. 43.SH 44SAX and DOM 45.LP 46There are two conventional programming interfaces to an XML 47parser. A SAX parser provides a stream of XML entities, leaving 48it up to the application to maintain the context. It is not possible 49to rewind the stream, except, perhaps, to the beginning. 50Using a SAX parser is 51fairly straightforward, but awkward: the stream-like nature 52of the interface does not map well to the tree-like structure 53that is XML. A DOM parser reads a whole document into an internal 54data structure representation, so a program can treat it exactly 55as a tree. This also enables a program to access parts of the 56document in an arbitrary order. 57The DOM approach is all very well for small documents, but for large 58documents the memory usage can rapidly grow to exceed 59the available memory capacity. For eBook documents, this is unacceptable. 60.SH 61A different approach 62.LP 63The XML parser used in the eBook browser is akin to a SAX parser, 64in that only a little of the XML structure is held in memory at one time. 65The first significant difference is that the XML entities returned are 66taken from one level of the tree - if the program does not wish to 67see the contents of a particular XML tag, it is trivial to skip over. 68The second significant difference is that random access is possible. 69This possibility comes from the observation that if we have visited 70a part of the document we can record the context that we found there 71and restore it later if necessary. In this scheme, if we wish to return later to 72a part of a document that we are currently at, we can create a ``mark'', 73a token that holds the current context; at some later time we can use 74that mark to return to this position. 75.LP 76The eBook browser uses this technique to enable random access 77to the document on a page-by-page basis. Moreover a mark 78can be written to external storage, thus allowing an external 79``index'' into the document so it is not always necessary to 80read the entire document from the start in order to jump to a particular 81page in that document. 82.SH 83The programming interface 84.LP 85The interface is implemented by a module named 86.CW Xml , 87which provides a 88.CW Parser 89adt which gives access to the contents of an XML document. 90Xml items are represented by an 91.CW Item 92pick adt with one branch of the pick corresponding to each 93type of item that might be encountered. 94.LP 95The interface to the parser looks like this: 96.P1 97open: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string); 98Parser: adt { 99 next: fn(p: self ref Parser): ref Item; 100 down: fn(p: self ref Parser); 101 up: fn(p: self ref Parser); 102 mark: fn(p: self ref Parser): ref Mark; 103 atmark: fn(p: self ref Parser, m: ref Mark): int; 104 goto: fn(p: self ref Parser, m: ref Mark); 105 str2mark: fn(p: self ref Parser, s: string): ref Mark; 106}; 107.P2 108To start parsing an XML document, it must first be 109.CW open ed; 110.CW warning 111is a channel on which non-fatal error messages will be sent 112if they are encountered during the parsing of the document. 113It can be nil, in which case warnings are ignored. 114If the document is opened successfully, a new 115.CW Parser 116adt, say 117.I p , 118is returned. 119Calling 120.CW \fIp\fP.next 121returns the next XML item at the current level of the tree. If there 122are no more items in the current branch at the current level, it 123returns 124.CW nil . 125When a 126.CW Tag 127item is returned, 128.CW \fIp\fP.down 129can be used to descend ``into'' that tag; subsequent calls of 130.CW \fIp\fP.next 131will return XML items contained within the tag, 132and 133.CW \fIp\fP.up 134returns to the previous level. 135.LP 136An 137.CW Item 138is a pick adt: 139.P1 140Item: adt { 141 fileoffset: int; 142 pick { 143 Tag => 144 name: string; 145 attrs: Attributes; 146 Text => 147 ch: string; 148 ws1, ws2: int; 149 Process => 150 target: string; 151 data: string; 152 Doctype => 153 name: string; 154 public: int; 155 params: list of string; 156 Stylesheet => 157 attrs: Attributes; 158 Error => 159 loc: Locator; 160 msg: string; 161 } 162}; 163.P2 164.CW Item.Tag 165represents a XML tag, empty or not. The XML 166fragments 167.CW "<tag></tag>" '' `` 168and 169.CW "<tag />" '' `` 170look identical from the point of view of this interface. 171A 172.CW Text 173item holds text found in between tags, with adjacent whitespaces merged 174and whitespace at the beginning and end of the text elided. 175.CW Ws1 176and 177.CW ws2 178are non-zero if there was originally whitespace at the beginning 179or end of the text respectively. 180.CW Process 181represents an XML processing request, as found between 182.CW "<?....?>" '' `` 183delimiters. 184.CW Doctype 185and 186.CW Stylesheet 187are items found in an XML document's prolog, the 188former representing a 189.CW "<!DOCTYPE...>" '' `` 190document type declaration, and the latter an XML 191stylesheet processing request. 192.LP 193When most applications are processing documents, they 194will wish to ignore all items other than 195.CW Tag 196and 197.CW Text . 198To this end, it is conventional to define a ``front-end'' function 199to return desired items, discard others, and take an appropriate 200action when an error is encountered. Here's an example: 201.P1 202nextitem(p: ref Parser): ref Item 203{ 204 while ((gi := p.next()) != nil) { 205 pick i := gi { 206 Error => 207 sys->print("error at %s:%d: %s\n", 208 i.loc.systemid, i.loc.line, i.msg); 209 exit; 210 Process => 211 ; # ignore 212 Stylesheet => 213 ; # ignore 214 Doctype => 215 ; # ignore 216 * => 217 return gi; 218 } 219 } 220 return nil; 221} 222.P2 223When 224.CW nextitem 225encounters an error, it exits; it might instead handle the 226error another way, say by raising an exception to be caught at the 227outermost level of the parsing code. 228.SH 229A small example 230.LP 231Suppose we have an XML document that contains some data that we would 232like to extract, ignoring the rest of the document. For this example we will 233assume that the data is held within 234.CW <data> 235tags, which contain zero or more 236.CW <item> 237tags, holding the actual data as text within them. 238Tags that we do not recognize we choose to ignore. 239So for example, given the following XML document: 240.P1 241<metadata> 242 <a>hello</a> 243 <b>goodbye</b> 244</metadata> 245<data> 246 <item>one</item> 247 <item>two</item> 248 <item>three</item> 249</data> 250<data> 251 <item>four</item> 252</data> 253.P2 254we wish to extract all the data items, but ignore everything inside 255the 256.CW <metadata> 257tag. First, let us define another little convenience function to get 258the next XML tag, ignoring extraneous items: 259.P1 260nexttag(p: ref Parser): ref Item.Tag 261{ 262 while ((gi := nextitem(p)) != nil) { 263 pick i := gi { 264 Tag => 265 return i; 266 } 267 } 268 return nil; 269} 270.P2 271Assuming that the document has already been opened, 272the following function scans through the document, looking 273for top level 274.CW <data> 275tags, and ignoring others: 276.P1 277document(p: ref Parser) 278{ 279 while ((i := nexttag(p)) != nil) { 280 if (i.name == "data") { 281 p.down(); 282 data(p); 283 p.up(); 284 } 285 } 286} 287.P2 288The function to parse a 289.CW <data> 290tag is almost as straightforward; it scans for 291.CW <item> 292tags and extracts any textual data contained therein: 293.P1 294data(p: ref Parser) 295{ 296 while ((i := nexttag(p)) != nil) { 297 if (i.name == "item") { 298 p.down(); 299 if ((gni := p.next()) != nil) { 300 pick ni := gni { 301 Text => 302 sys->print("item data: %s\n", ni.ch); 303 } 304 } 305 p.up(); 306 } 307 } 308} 309.P2 310The above program is all very well and works fine, but 311suppose that the document that we are parsing is very 312large, with data items scattered through its length, and that 313we wish to access those items in an order that is not necessarily 314that in which they appear in the document. 315This is quite straightforward; every time we see a 316data item, we record the current position with a mark. 317Assuming the global declaration: 318.P1 319marks: list of ref Mark; 320.P2 321the 322.CW document 323function might become: 324.P1 325document(p: ref Parser) 326{ 327 while ((i := nexttag(p)) != nil) { 328 if (i.name == "data") { 329 p.down(); 330 marks = p.mark() :: marks; 331 p.up(); 332 } 333 } 334} 335.P2 336At some later time, we can access the data items arbitrarily, 337for instance: 338.P1 339 for (m := marks; m != nil; m = tl m) { 340 p.goto(hd m); 341 data(p); 342 } 343.P2 344If we wish to store the data item marks in some external index 345(in a file, perhaps), the 346.CW Mark 347adt provides a 348.CW str 349function which returns a string representation of the mark. 350.CW Parser 's 351.CW str2mark 352function can later be used to recover the mark. Care must 353be taken that the document it refers to has not been changed, 354otherwise it is likely that the mark will be invalid. 355.SH 356The eBook implementation 357.LP 358The Open eBook reader software uses the primitives described above 359to maintain display-page-based access to arbitrarily large documents 360while trying to bound memory usage. 361Unfortunately it is difficult to unconditionally bound memory usage, 362given that any element in an XML document may be arbitrarily 363large. For instance a perfectly legal document might have 100MB 364of continuous text containing no tags whatsoever. The described 365interface would attempt to put all this text in one single item, rapidly 366running out of memory! Similar types of problems can occur when 367gathering the items necessary to format a particular tag. 368For instance, to format the first row of a table, it is necessary to lay out 369the entire table to determine the column widths. 370.LP 371I chose to make the simplifying assumption that top-level items within 372the document would be small enough to fit into memory. 373From the point of view of the display module, the document 374looks like a simple sequence of items, one after another. 375One item might cover more than one page, in which case a different 376part of it will be displayed on each of those pages. 377.LP 378One difficulty is that the displayed size of an item depends on many 379factors, such as stylesheet parameters, size of installed fonts, etc. 380When a document is read, the page index must have been created 381from the same document with the same parameters. It is difficult in 382general to enumerate all the relevant parameters; they would need 383to be stored inside, or alongside the index; any change would invalidate 384the index. Instead of doing this, as the document is being displayed, 385the eBook display program constantly checks to see if the results 386it is getting from the index match with the results it is getting 387when actually laying out the document. If the results differ, the 388index is remade; the discrepancy will hopefully not be noticed by 389the user! 390