RSS feeds or may be something else

Noppy · January 29, 2018, 10:55am

are you trying to apply this to all websites or will you be targeting specific websites?

Given that estimates for the amount of servers that google is running is in the region of 900,000 http://www.datacenterknowledge.com/archives/2011/08/01/report-google-uses-about-900000-servers (and that was in 2011) it’s unlikely that you are going to be able to match what google is able to do anytime soon. Even if you could write an algorithm to match, you will be severly lacking in the hardware side to implement it.

If you are looking to add specific websites that you can then target for marketing/support purposes then that is more realistic.

But as already said you will need to develop a crawler that follows links on the sites webpages as it is unreliable to rely on sitemaps. A sitemap won’t necessarily tell you that a page has been updated and also if they made a manual sitemap they might have forgotten to add pages to it.

codeispoetry · January 29, 2018, 11:04am

Hi there @Noppy,

Thanks for the input.

Those servers are used to hold the data not essentially they are used in crawling. You can imagine those servers as a sum total of the web hosting companies servers that exist in the World today.

they store the entire web pages of a domain and even offer them as cached web pages when the site is down(You must have seen them a couple of times) when the actual live page is down.

My objective is to crawl(the whole list of domains) and retain a significantly small information to be used in the analysis later.

TechnoBear · January 29, 2018, 11:10am

What do you mean by “the whole list of domains”? Every domain on the Internet, or all the pages of those sites who sign up for your analytics services? If the latter, then you could simply ask your clients to ensure they add a sitemap, if you’re worried about missing links. If the former, then even without storing data, you’re going to need some serious amount of hardware to allow you to crawl large numbers of sites at once.

Noppy · January 29, 2018, 11:13am

As i don’t work for google i don’t know what % of those servers do the crawling but even if 1/100th of those servers are used for crawling that would still be 9,000 servers!

There are currently over a billion websites in the world so even a very small bit of information from each adds up to a ‘lot’ of data. Even Analysis of a huge data set is going to be hardware heavy and require some complex software to rank each site.

Additionally it will need doing on a regular basis, which will again take time and resources

I don’t mean to be negative but if i understand you correctly i can’t see this working. I would suggest you scale this back to a very small set of pre-determined websites and add new ones when you want to engage with them.

codeispoetry · January 29, 2018, 11:15am

Wrong information. There are between 350million - 400 million total numbers of a domain that are registered out of them only 27.8% are seriously functional with a significant amount of content on them. Please do not assume anything. Various parameters are marketing inflations.

Musk was refused by Vladimir Putin and later he ended up creating his own rocket at 1% of the projected manufacturing cost by Russia. Please talk about exact numbers, the real ones.

#Common sense:
the total population of the world is around 5 billion. Look around you how many people owns a domain. Developing or underdeveloping countries even do not have more than 20% internet penetration. so “billions of website” was inflated number with no realistic approach.

codeispoetry · January 29, 2018, 11:16am

Not at once, but periodical and streamlined into organized batches.

TechnoBear · January 29, 2018, 11:24am

Yes, but that’s surely still multiple sites at one time, which is what I was commenting on. (I wasn’t trying to suggest you need to crawl every site at the same time.)

You didn’t answer my questions. Am I right in thinking your intention is to crawl as much of the Web as possible, not just selected customers?

codeispoetry · January 29, 2018, 11:25am

Yes, the entire list of domains.

TechnoBear · January 29, 2018, 11:27am

OK - thanks for clarifying that.

Can I ask what kind of information you envisage collecting?

Noppy · January 29, 2018, 11:27am

I said websites you said domains. They are different. That is why there are different numbers out there. You can have subdomains that are different ‘websites’.

Wrong it is about 7.5 Billion. http://www.worldometers.info/world-population/

Whilst i support people who want to compete with the big boys, there will need to either be a level of skill or an area to exploit in the market that has not yet been plundered.

Seeing as you are having to ask how google crawls websites at even a basic level, i am yet to be convinced that you are going to be able to build a crawler capable of crawling even 100,000 sites on a regular basis. But i am happy to be proved wrong.

codeispoetry · January 29, 2018, 11:32am

My numbers may be slightly wrong, but the idea was to give you “an idea of the reality around you”.

Finally, Thank you so much.

Everything starts someday. even Larry Page was born naked with no skills. It is not the technologies alone that create companies, but the dream and indefatigable will of the people.

Nelson Mandela was imprisoned at the age of 34 and was released when he was 61, but he still achieved such an impossible feat.

But for now, I am not even thinking of creating a search engine, but I have a different model very smaller than what the actual search engines do.

You do not need to know everything. Founder of alibaba.com even today doesn’t know completely what is bitcoin. If I have to fight a case in a court of law that doesn’t mean I will go and pursue a Law degree but search for someone who can do that for me.

codeispoetry · January 29, 2018, 11:50am

This seems very interesting, and that means that if we work on means/average than every registered domain has between 5-6 subdomains at an average, which seems to be a big number.

Noppy · January 29, 2018, 12:04pm

the can also be websites that don’t have a domain name, just an IP address.

might be worth looking at https://zmap.io/ which is linked from the tekeye link i posted.

Mittineague · January 29, 2018, 5:49pm

Hypotheticals and estimated unknowns aside, I think a good start would be to write a crawler for your own site. Once you have the code to recursively iterate over your own pages you will have some idea of what’s involved. Once you’re happy with how it works for your own site you could then try it for other sites.

Before you attempt to crawl other sites, be warned. I once blocked the Yandex crawler because it was hitting my server more often than I liked. Not everyone monitors their server loads, but for those that do if they are unhappy about your HTTP Requests to their site you may also end up being unhappy.

codeispoetry · January 29, 2018, 6:03pm

Even hosting companies if see too much traffic to their servers may block unless we speak to them in advance that we are not harming people so please whitelist us(this is possible I have spoken to the digital ocean, for example, but conditions applied.).

A ready made script here.

andrealarson92 · March 27, 2018, 9:09am

Exactly. Just like wix or weebly. There are literally thousands of different websites on subdomains. A great example of single domain including thousands of websites.

More on that is written here: https://www.millforbusiness.com/how-many-websites-are-there/

system · June 26, 2018, 4:10pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.