[Discuss] What's the best site-crawler utility?
Tom Metro
tmetro+blu at gmail.com
Tue Jan 7 22:23:52 EST 2014
Matthew Gillen wrote:
> wget -k -m -np http://mysite
I create an "emergency backup" static version of dynamic sites using:
wget -q -N -r -l inf -p -k --adjust-extension http://mysite
The option -m is equivalent to "-r -N -l inf --no-remove-listing", but
I didn't want --no-remove-listing (I don't recall why), so I specified
the individual options, and added:
-p
--page-requisites
This option causes Wget to download all the files that are necessary
to properly display a given HTML page. This includes such things as
inlined images, sounds, and referenced stylesheets.
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded
and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local
filename. This is useful, for instance, when you're mirroring a
remote site that uses .asp pages, but you want the mirrored pages to
be viewable on your stock Apache server. Another good use for this
is when you're downloading CGI-generated materials. A URL like
http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
> '-k' ... may or may not produce what you want if you want to actually
> replace the old site, with the intention of accessing it through a web
> server.
Works for me. I've republished sites captured with the above through a
server and found them usable.
But generally speaking, not all dynamic sites can successfully be
crawled without customizing the crawler. And as Rich points out, if your
objective is not just to end up with what appears to be a mirrored site,
but actual clean HTML suitable for hand-editing, then you've still got
lots of work ahead of you.
-Tom
--
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/
More information about the Discuss
mailing list