Sunday, May 10, 2009

Create a mirror of a website with Wget

GNU's wget command line program for downloading is very popular, and not without reason. While you can use it simply to retrieve a single file from a server, it is much more powerful than that and offers many more features.

One of the more advanced features in wget is the mirror feature. This allows you to create a complete local copy of a website, including any stylesheets, supporting images and other support files. All the (internal) links will be followed and downloaded as well (and their resources), until you have a complete copy of the site on your local machine.

In its most basic form, you use the mirror functionality like so:

$ wget -m http://www.example.com/

There are several issues you might have with this approach, however.

First of all, it's not very useful for local browsing, as the links in the pages themselves still point to the real URLs and not your local downloads. What that means is that, if, say, you downloaded http://www.example.com/, the link on that page to http://www.example.com/page2.html would still point to example.com's server and so would be a right pain if you're trying to browse your local copy of the site while being offline for some reason.

To fix this, you can use the -k option in conjunction with the mirror option:

$ wget -mk http://www.example.com/

Now, that link I talked about earlier will point to the relative page2.html. The same happens with all images, stylesheets and resources, so you should be able to now get an authentic offline browsing experience.

There's one other major issue I haven't covered here yet - bandwidth. Disregarding the bandwidth you'll be using on your connection to pull down a whole site, you're going to be putting some strain on the remote server. You should think about being kind and reduce the load on them (and you) especially if the site is small and bandwidth comes at a premium. Play nice.

One of the ways in which you can do this is to deliberately slow down the download by placing a delay between requests to the server.

$ wget -mk -w 20 http://www.example.com/

This places a delay of 20 seconds between requests. Replace that number, and optionally you can add a suffix of m for minutes, h for hours, and d for ... yes, days, if you want to slow down the mirror even further.

Now if you want to make a backup of something, or download your favourite website for viewing when you're offline, you can do so with wget's mirror feature. To delve even further into this, check out wget's man page (man wget) where there are further options, such as random delays, setting a custom user agent, sending cookies to the site and lots more.

More advanced wget usage:

No parent option

If you are doing a mirror, but you only want to mirror a subdirectory of the main site (for example, just /news/), you might run into a problem. Because many of the pages at /news/ link back to /, you'll inadvertently end up downloading the whole site.

The solution to this, pointed out by Todd in the comments, is to use the no parent option, -np.

In our example, we'd do:

$ wget -mk -w 20 -np http://example.com/news/

Update only changed files

Continuing in our mirroring scenario, another extremely useful option for preserving bandwidth on both sides is to update only the files that the server reports as changed.

This option is -N.

$ wget -mk -w 20 -N http://example.com/

Thanks to Paul William Tenny in the comments for that tip.

Random delay on mirror

And finally for our mirror-specific tips, you can also randomise the delay between downloads. There are several reasons you might want to do this, including sites that don't take kindly to being mirrored, even considerately, and block clients that they suspect of doing it (some bots can be pretty nasty, and you might be categorised as one of 'them').

Randomising the wait time - and combining with the user agent option below - can be steps to circumvent this automatic protection.

If you do find yourself using this feature for that reason, please continue to be considerate and follow any rules regarding the content you've been given. Mirror responsibly.

$ wget -w 20 --random-wait -mk http://example.com/

The wait value - 20 in this case - is used as a base value to calculate what the random wait times will be. They will alternate between 0 and 2 times that value (in this case, 0-40 seconds).

Custom user agent

Some sites might have some strange restrictions on what browsers can access it, or perhaps have different versions of a site depending on the browser used. I can't say I agree with sites that do this, unless there's a really good reason, but it shouldn't stop you from using wget for access.

Using wget, you can set a fake user agent string so that the program reports itself as a different browser.

$ wget -U "user agent" http://example.com/

Combine the -U option with any others you want, obviously. Here are a few user agents you can use to get you started:


IE6 on Windows XP: Mozilla/4.0 (compatible; MSIE 6.0; Microsoft Windows NT 5.1)
Firefox on Windows XP: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
Firefox on Ubuntu Gutsy: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14
Safari on Mac OS X Leopard: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/523.12.2 (KHTML, like Gecko) Version/3.0.4 Safari/523.12.2

That's it for now, if you have any more useful wget tips and tricks, share them in comments.

No comments: