webgrab (revision cb97c916bd35d8fd492e26a656f9122b45d0436c) - OpenGrok cross reference for /inferno-os/man/1/webgrab

     WEBGRAB 1
 NAME
webgrab - fetch web page content as files
 SYNOPSIS
 webgrab [
 -r ] [
 -v ] [
 -o " stem" ] [
 -p " body" ]
 url  DESCRIPTION
 Webgrab connects to the web server named in the
 url . It fetches the content of the web page also determined by the
 url , and stores it locally in a file.
If the page is written in HTML,
 webgrab reads it to build a list of sub-component pages (eg, frames) and images.
It fetches those, saving the content in separate files.
It adds a comment to the end of each HTML file giving the time, and the file's origin.
It automatically follows redirections offered by the server.

The
 stem of the names of the output files is normally derived from a component of the
 url . If the
 url contains a path name, the
 stem is the component of that path, less any dot-separated suffix and prefix.
For example, given
 http://www.vitanuova.com/inferno/old.index.html
the stem would be
 index . If there is no path name, but the
 url contains a domain name, the
 stem is the penultimate component of the domain name (eg, excluding
trailing
 .com , and initial
 www , etc).
For example, given
 www.innerhost.vitanuova.com
the stem would be
 vitanuova . If all else fails,
 webgrab uses the
 stem  webgrab .
Given a
 stem , the initial page is stored in
 stem . suffix where
 suffix is the suffix (eg,
 .html ) of the name of the original page.
Subordinate pages are saved in a similar way in files named
 stem _1. suffix1,  stem _2. suffix2, ... .

The options are:

 -r do not fetch subcomponents (just the `raw' source of
 url itself)

 -v print a progress report

 -vv print a chatty progress report

 -o " stem" use the
 stem as given

 -p " body" Use HTTP
 POST instead of
 GET , posting
 body as the data

 Webgrab reads the
configuration file
 /services/webget/config (if it exists),
to look for the address of an optional HTTP proxy
(in the
.L httpproxy
entry), and list of domains for which a proxy should not be used
(in the
 noproxy or
 noproxydoms entry). If symbolic network and service names might be involved, the
connection server
 lib/cs needs to be already running.
 FILES
 /services/webget/config  SOURCE
 /appl/cmd/webgrab.b  BUGS
It should read the proxy name from the
 charon (1) configuration file and not the
 webget configuration file.

It cannot do `secure' transfers
 ( https ).
Its HTML parsing is naive, but on the other hand, it is less likely to trip over HTML novelties.
 "SEE ALSO"
 cs (8)