Metanet Forums

Most holidays I usually go up to my beach-house, and there's no Internet there. So, I was wondering if it's possible to cache/download the entire RealN Forums as an archive to look through while I'm there.

Try this.

wget -r forums.therealn.com
or somesuch. Suki probably knows the command. I don't.

scythe33 wrote:wget -r forums.therealn.com
or somesuch. Suki probably knows the command. I don't.

All I did was write a spider in Python.

wget would probably download only the first page of everything, or maybe just the stylesheet if all of the post data is private (since the page would load all of the posts based on which tags you put in the URL... or whatever those & doodads are called; I don't know PHP).

Trying it anyway, expecting hilarious results...

...
Holy fuggin' crap, it worked. I mean, I hope you like reading URL's and are prepared to see a royal lot of "you need to log in for this" pages, but it's actually downloading every page of every topic. Amazing.

Code: Select all

wget -r http://forum.therealn.com

Good call, scythe. o_O

If you add the -k switch, it'll change all of the links into local URLs.

The problem with asking Linux nerds for help is that they often get so involved in finding a solution that they exceed the abilities of the person asking for help.

To run that command, UZ, you'll need either an Linux distribution or Cygwin. There's an excellent guide to setting up Cygwin here.

wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?

smartalco wrote:wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?

I'm guessing it does something like this: (I don't actually know, but this would be the simplest- though not the most efficient- solution)

IF adress starts with http://forum.therealn.com AND page has not yet been seen THEN
save page
repeat this process with every link on the page
END IF

So basically it just clicks every link, ignoring it if it is not part of therealn.com (to prevent it from attempting to download the entire interblag) or if it's seen it before (to prevent infinite loop) Oh and it probably ignores everything after a hash symbol (#) so that you don't get teh same page at different positions, but you do get a different page for different variable before the hash (topic and thread)

Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?

smartalco wrote:wait, wait, shouldn't there only be a handful of actual physical pages? How the fuck is it getting the whole forum when 95% of the pages are just the same viewtopic.php page with random variables after it? Is it actually parsing each page, looking for links, and then downloading the page that the link results in?

Sounds like you thought exactly what I thought would happen, but yeah, it does actually appear to follow on-site links.

Exüberance wrote:Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?

At the rate the pipes can supply data, you would never finish downloading the internet, as content is being created faster then you could download it (and this will continue to be true as internet speeds increase, as the rate of content creation will also increase)

Exüberance wrote:Random Thought: If you were to download the entire internet, (a) how many years would it take at various constant download speeds and (b) how much space would you need?

Much of the content on the Internet is dynamic; it wouldn't really be possible to download it all (imagine downloading every search query page on every search engine).

Oh yeah.... way to kill a thought experiment.

I guess what I'm wondering is how much space is currently taken up by everything on the internet (as in the filesize of each webpage and it's components, so dynamic pages is the filesize of the code, not each possible webpage you could download)

That would be like the uber1337 version of a jelly-bean contest except it would be impossible to actually figure out the answer :( that's no fun. I'm not even going to attempt to guess because even on a logarithmic scale I'd probably be way off.

You aren't allowed to keep that avatar if you just give up.

Exüberance wrote:Oh yeah.... way to kill a thought experiment.

I guess what I'm wondering is how much space is currently taken up by everything on the internet (as in the filesize of each webpage and it's components, so dynamic pages is the filesize of the code, not each possible webpage you could download)

That would be like the uber1337 version of a jelly-bean contest except it would be impossible to actually figure out the answer :( that's no fun. I'm not even going to attempt to guess because even on a logarithmic scale I'd probably be way off.

keep in mind that Google and other search engines do, in many ways, keep a local copy of the internet. Of course, search engines only deal in HTTP, and even then only in some of it - pages can forbid search engines from indexing them via robots.txt or meta tags, and even beyond that there's the section of the internet often referred to as the 'dark web' which, for various reasons, is inaccessible to search engines. A much higher percentage of the content out there is 'dark' than you might think.

If we look at the scripts that generate webpages and ignore things outside of HTTP(S), I'd imagine it's really quite small. The bulk of the information on the web is stored in databases of various sorts, the scripts only provide an interface to those databases.

All hail GNU wget.

http://www.gnu.org/software/wget/

Metanet Forums

Cache

Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache

Re: Cache