Archiving a Static Site With Wget
文章未翻译
写的时候不用中文,还不去翻译,真的是懒死了(
There’s this static site that I know will probably be going down soon. It’s nice if I can find a way to create a full backup of it, then maybe rehost it after that inevitably happens.
Archiving tools
Nowadays, we mainly got two kinds of archiving tools, “smart” and “dumb”.
Dumb Archives
These are basically your good ol’ crawlers. They would download an HTML page, extract links, then follow those to find new pages. Some even follow links presented in other types of files, like import()
s in CSS. The thing they don’t do is evaluating JavaScript, which will greatly complicate things. After the crawl, you usually get all the files discovered from the site preserved in their original folder structure.
Smart Archives
These uses a headless browser in the background, and simulate humans visiting the target site. After the site completes loading, we take snapshots in different formats, so that with the correct tools, accurately replaying this visit is possible. We got simpler formats like PDF, HTML, JPEG/PNG, and complex ones like WARC, which also keeps harvested metadata like HTTP headers and allows more accurate recreation of the original viewing experience. I’ve used words like replay and recreation here, because we’re not trying to clone the actual site, just the viewing experience.
Using Wget
Let’s take a look at our target first. It’s a fully static site from the jQuery age. Since I would like to rehost the site if needed, I would have to use a dumb archive tool. In the end, I chose wget(1)
to get this done. Specifically this command:
wget --mirror \
--page-requisites \
--adjust-extension \
--convert-links \
--backup-converted \
--domains=resource.com,other-res.com \
--span-hosts
http://example.com/index.html
Additionally, you may want to add --domains=resource.com,other-res.com
and --span-hosts
if you want to fetch external resources, more details below. If there are external files hosted on resource.com
or other-res.com
, they will be fetched as well and stored in their own folders. Other external links will remain unvisited.
With this alone, Wget can already fetch all pages reachable from index.html
, and converts any absolute URLs to relative ones. It also automatically sets the file extension if the URL doesn’t contain the correct one. The folder structure gets preserved and viewable in ./example.com
.
Serving Files
You can now run a local file server and take a look at the files.
###### SWITCH TO THE NEW FOLDER FIRST ######
cd example.com
# With python 3.0 and above
python -m http.server 8000
# Or 2.X
python -m SimpleHTTPServer
# Maybe you have php installed? Fancy.
php -S 127.0.0.1:8080
# Dirty your hands with yet another npm package
npm install http-server -g
http-server .
JavaScript Redirects
You may find some buttons or links that redirects you to the online version of the site, or throws a 404 at you. These are usually links implemented in JavaScript, which are only injected into the DOM when executed. Wget doesn’t execute any scripts, so the task of discovering those hidden pages is left to you. At this step you may have to hand edit any of the absolute URLs in the .js files, go back to your original $pwd, and manually fetch those files with the same command. Just swap out the URL in the command would suffice.
For example, my target site has a lot of frontend redirects that looks like this:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Example Site</title>
<meta charset="utf-8">
<script>location.href = 'news.html';</script>
</head>
<body></body>
</html>
Wget have no way of knowing the existence of news.html
, so I would have to do that manually. It will try to download some already downloaded content, but they’re not redownloaded because Wget supports caching headers(E-tag, expires, etc.) You can also skip the check completely by specifying the --no-clobber
flag
External Resources
There’s also the problem with external resources. Your file server of choice would probably disallow traversing to folders outside your $pwd, since that would cause security issues1. Because I want to rehost the site on one single domain, I would just move all the resources into the example.com
folder, and do some search and replace. The --convert-link
behavior makes this a bit tricky. For the sake of local file access (with the file://
URI scheme), external URLs, when fetched, will also be converted into relative URLs. The end result would look like ../../../external.com/font.ttf
. I’ll just search for them with regex.
find ./ -type f -exec sed --in-place --regexp-extended 's/(\.\.\/)+external\.com\//\/external_fonts\//g' {} \;
Which changes all ../external.com/ into /external_fonts/. WARNING: --in-place
will update your original files, which may be irreversible.
Of course, you could avoid all this by skipping the --domains
and --span-hosts
flags when executing Wget, and pull in the resources yourself. Could be easier if you don’t have many external files. Or just skip step this if you don’t care about them.
Final Checklist
To make sure I didn’t miss anything, I basically did a text search of example.com
, and excluding results that aren’t interesting to me.
grep -rn . -e "//example\.com" | grep -v -e "facebook" | grep -v "twitter" | grep -v ".orig" | grep -v "og:url" | grep -v "fb-like"
I just kept adding the excluded keywords util all the junk is out. This way, I can be sure that I didn’t miss anything. I also checked for additional JavaScript redirects:
grep -rn . -e "location.href"
And stumbled upon this gem by accident:
if(loc.search('typo.com') != -1) {
location.href ='http://example.com/';
return;
};
I’ll just leave this here.
Update: 2023
This technique has worked fairly well for the past year or so. I’ve used it on multiple sites, and got everything I need. The only problem I’ve encountered is that some sites load resources with JavaScript, which Wget can’t handle. Here are 2 ways that I’ve been using to handle this.
- If the urls have a predictable pattern, I’ll simply enumerate them:
$ENV{PATH} = '/bin:/usr/bin';
for($id=0;$id<=10;$id++){
system("wget","--mirror","--page-requisites","--convert-links","--backup-converted","-e use_proxy=on","-e https_proxy=127.0.0.1:12345",sprintf("https://example.com/images/outline/thumb/%02d.png",$id))
}
You can of course do this with any scripting language you’re familiar with.
- If they don’t follow a specific pattern(e.g. hashes, UUIDs) and the amount is manageable, I’ll do it by hand with this:
while read -r line; do wget --mirror --page-requisites --adjust-extension --convert-links --backup-converted --domains=example.com --span-hosts "https://example.com/$line"; done
This basically reads stdin and use each line as the path of an URL that’ll be downloaded by Wget.
I’ve also considered a “pull-through cache” approach, where an HTTP proxy automatically saves the files you’ve viewed to disk. Could be an idea to explore.
Dynamic content (anything not in the HTML) is always a headache for archiving. Special solutions are definitely needed on a case-to-case basis.
And that’s it for now. This is just a quick update, and I’ll consider rewriting this whole mess if I have to touch it again.
Result & Final Thoughts
By now, you should have a fully functional, offline mirror of your target site. For this site in particular, it’s extra easy. There are none anti-crawl functionally enabled, either on the origin, nor the CDN layer. It also doesn’t make use of any JavaScript frameworks that uses Virtual DOM like React and Vue. This simple architecture allows dumb tools like Wget to easily discover other files on the server, then fetch them with ease. For modern sites that are more complex, they may not work at all. In that case, consider using a smart archiving tool, but based on my experience, they definitely don’t perform as well. The end result sometimes have incorrect layout, and they’re also pretty slow. Anyways, happy archiving.