Not perfect, but fine for the Strukturtypen site. This might be an issue if there are such numbers in the site structure not originating from the Wayback machine. $)/|/20120713212803$2/|gĮffectively every 14-digit-number enclosed in /./ gets replaced with 20120713212803 (adjust that to the most recent snapshot of your desired site).The former solves mpy's "Problem 2 + Solution" the latter, "Problem 3 + Solution." httrack supports both pattern-based URL exclusion and simplistic site restructuring. New Solution: httrackĮnter httrack, a command-line utility implementing a superset of wget's mirroring functionality. But most superusers might be better serviced by simpler, more readily applicable solutions. Is wget + Privoxy worth a look? Absolutely. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites. It should work in a scalable, robust manner. Or simple.ĭue to the imponderable technical hurdle of properly installing, configuring, and using Privoxy, we have yet to confirm mpy's solution. While wget is reasonably simple to configure, Privoxy is anything but reasonable. Mpy then provides a robust solution leveraging both wget and Privoxy. Unreliable and overly simplistic solutions are non-solutions. As mpy admits:Īlthough Scrapbook failed so far to grab the site completely. It's questionable whether ScrapBook even adequately addresses the former problem. functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates. That was probably all you needed to know. Prinz recommends ScrapBook, a Firefox plugin. This issue has been on public record since at least 2009. Since the URLs to be excluded contain arbitrarily many / characters, wget cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. While wget does provide a command-line -exclude-directories option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions they're simplistic globs whose * syntax matches zero or more characters excluding /. These are assets provided by other sites merely linked to from the target site (e.g., ).įailing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos). These are assets provided by the target site (e.g., ). When mirroring the example domain, your mirroring tool must: By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."Ī concrete example is in order. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. In the absence of complex external URL rewriting (e.g., Privoxy), wget cannot be used to reliably mirror Wayback-archived sites. Naturally, that recommendation is fundamentally unsound. The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending wget for Wayback mirroring. Let's get on with it, shall we? Prior Solution 1: wget Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. To say that this complicates mirroring would be a substantial understatement. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. In h sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme. Backgroundīefore we get to that, however, consider perusing mpy's well-written response. In this post, we briefly detail the difficulties with each and then offer a modest httrack-based solution. While helpful, prior responses fail to concisely, reliably, and repeatably solve the underlying question.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |