strugglingon, the best way to store this content will depend on the things you want to do with this content
But... for sure I can tell you that storing all data in one file is not effecient, better use separate files for each node/domain you are processing. This can make a benefit in time when you will be reading content locally.
For more advanced scheme I would recommend to use a database fro indexing urls and mapping files to each database record.
Uhm, you would certainly have to work in some regular expressions I'd think that all that content would add up like mad.
Btw, what would you be using the content for?
Please be carefull with copyright issues.
The idea is for it to be used to searialise the pages of online help manuals so that the user can then have the aibility to print the whole thing as a book rather than going through each page and printing individually.