xref: /plan9/sys/man/1/doc2txt (revision 8a5942f0d176a59d1ea46380ae95879ebf31a9f6)
DOC2TXT 1
NAME
doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables - extract printable text from Microsoft documents
SYNOPSIS
doc2txt [ file.doc ]

doc2ps [ file.doc ]

wdoc2txt [ file.doc ]

xls2txt [ file.xls ]

aux/olefs [ -m mtpt ] file.doc

aux/mswordstrings mtpt /WordDocument

aux/msexceltables [ -qaDnt ] [ -d delim ] [ -c column-range ] [ -w worksheet-range ] mtpt /Workbook

DESCRIPTION
Doc2txt is an rc (1) script that uses olefs and mswordstrings to extract the printable text from the body of a Microsoft Word document and write it on the standard output. Doc2ps is similar, but emits PostScript corresponding to the document. Wdoc2txt is similar to doc2txt , but uses plumb (1) to send the output to a new acme (1) window instead. Xls2txt performs a similar function for Microsoft Excel documents.

Microsoft Office documents are stored in OLE (Object Linking and Embedding) format, which is a scaled down version of Microsoft's FAT file system. Olefs presents the contents of an MS Office document as a file system on mtpt , which defaults to /mnt/doc . Mswordstrings or msexceltables may then be used to parse the files inside, extracting a text stream. Msexceltables may be given options to control the formatting of its output. .TF "\fL-d delim"

-a Attempt conversion of non-tabular sheets in the workbook (charts).

-d " delim Sets the inter-field delimiter to the string delim , by default a single space.

-D Enables debugging output.

-c " range Range is a comma-separated list of column numbers and ranges. Ranges are separated by dashes. Limit processing to just those columns named; by default all columns are output.

-n Disables field padding to column width.

-q Disable quoting of textural fields (see quote (2).)

-t Truncate fields to the column width.

-w " range Range is a comma-separated list of worksheet numbers and ranges, this limits the sheets output using the same syntax as the -c option above. Suppressed chart pages are always included in the sheet count.

EXAMPLE
Extract pieces of an MS Excel spreadsheet.

0

.EX aux/olefs report.xls msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt unmount /mnt/doc

SOURCE
.TF "\fL/sys/src/cmd/aux "

/rc/bin doc2txt , doc2ps , wdoc2txt, and xls2txt

/sys/src/cmd/aux the others

SEE ALSO
strings (1)

``Microsoft Word 97 Binary File Format'', at Microsoft's developer (MSDN) home page.

``LAOLA Binary Structures'', http://user.cs.tu-berlin.de/~schwartz/pmh

``OpenOffice.Org's Excel Documentation'',

http://sc.openoffice.org/excelfileformat.pdf