Jonathan Polansky

A pretty good webpage.

Using wget to create a static local copy of a website

I was recently working on a project where I needed to retire a >10 year old system with more homegrown webapps than you could shake a stick at. Some were able to be migrated. Some were not. As I was about to shut the machine down, it was brought to my attention that a section of one of the non-migratable webapps would need to be preserved in read-only format. After some thought about opensource website spiders, wget came to the rescue.

Besides the traditional use of wget for locally saving individual URLs, wget provides advanced “recursive” retrieval functionality. This causes wget to follow links and retrieve entire subsets of pages to a certain level. Interesting flags are (from the man page):

Turn on recursive retrieving.

-l depth
Specify recursion maximum depth level depth. The default maximum depth is 5.

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

Using the information from above, I was able to construct the wget command below to retrieve a page and all pages it links to (“-r”) one level below (“-l 1″) the specified page. The command also retrieves all asset dependencies of the pages and saves them locally (“-p”). Locally saved pages have their links rewritten such that all links work when browsing the pages locally (“-k”).

wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt --keep-session-cookies -r -l 1 -k -p -E ""

You may notice that I also used the “-E” option in the command. This instructs wget to append “.html” to all locally saved pages which do not have a recognizable html extension. wget will also update all links in the saved pages to reflect the filename change. In my case, this was necessary to have the web server not try and interpret the get parameters in the page filenames as actual parameters.

Finally, I had to login to the webapp in question to be able to access the page I wanted to cache which required me to incorporate cookies into the request. I used the hints on the following page to figure out how to have wget use cookies:

I used firefox plug-in “Live HTTP Headers” to figure out what get parameters I had to pass to get to authenticate and set the proper cookie. My command to set the cookie was:

wget --keep-session-cookies --save-cookies=my-cookies.txt ""

Tagged as + Categorized as Tech


comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks

  1. Blast from the Past | Michael Welburn

    [...] but now that I have some hosting space I am able to make the site available again. For the record, wget is an awesome utility I used to point at the database driven site a long time ago to create a [...]

Leave a Reply