[Discuss] What's the best site-crawler utility?

Tue Jan 7 22:23:52 EST 2014

Matthew Gillen wrote:
>   wget -k -m -np http://mysite

I create an "emergency backup" static version of dynamic sites using:

wget -q -N -r -l inf -p -k --adjust-extension http://mysite

The option -m  is equivalent to "-r -N -l inf --no-remove-listing", but
I didn't want --no-remove-listing (I don't recall why), so I specified
the individual options, and added:

  -p
  --page-requisites
    This option causes Wget to download all the files that are necessary
    to properly display a given HTML page.  This includes such things as
    inlined images, sounds, and referenced stylesheets.

  --adjust-extension
    If a file of type application/xhtml+xml or text/html is downloaded
    and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
    option will cause the suffix .html to be appended to the local
    filename. This is useful, for instance, when you're mirroring a
    remote site that uses .asp pages, but you want the mirrored pages to
    be viewable on your stock Apache server.  Another good use for this
    is when you're downloading CGI-generated materials.  A URL like
    http://site.com/article.cgi?25 will be saved as article.cgi?25.html.

> '-k' ... may or may not produce what you want if you want to actually
> replace the old site, with the intention of accessing it through a web
> server.

Works for me. I've republished sites captured with the above through a
server and found them usable.

But generally speaking, not all dynamic sites can successfully be
crawled without customizing the crawler. And as Rich points out, if your
objective is not just to end up with what appears to be a mirrored site,
but actual clean HTML suitable for hand-editing, then you've still got
lots of work ahead of you.

 -Tom

-- 
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/