I think RSS feeds helps one understand what links are active on a website. Is there any other website.
I am asking this because I wanted to know how can we know about all the links that exist for a website, and is there a method to know what new links are added and what already existing links are updated, changed or perhaps deleted.
Can a sitemap be found on every website? Do all webmasters create sitemaps for their respective websites?
I have one more question if you can answer:
Is crawling different from scrapping. Search engines such as Google, Yandex they crawl every website and store the content/data on their servers and then deliver results. When they crawl aren’t their IP’s blocked. How does such a huge crawling is accomplished by search engines bots?
P.S. → Please advise if my second question is different, and I need to open a new thread.
There are two types of sitemap - xml for search bots and html for human visitors. It is not required to have either; it’s left to the judgement of webmasters whether or not they need one or both. All my sites are quite small, and I don’t have a sitemap on any of them.
I’m not sure I understand your second question, but this video might help:
I have a dream to build an analytics company in future for that I will need to crawl the entire domain list. I presume that hosting companies will block IP’s also or perhaps there will be some other way around to tell them that we are not spammers we are doing the similar job what search engines do.
So to accomplish this the basic fundamental logic is this:
List of the domain, which is available in the market.
Crawler script(I know how difficult it could be to have a highly scalable crawlers script)
Domains sitemap; that means all the URL’s of a particular domain.
But I think it is not possible to crawl the entire web quite often(every day, every week or perhaps every month). If we have a sitemap(or some other similar method) that tells us if the page has been updated(or the new page added) then perhaps we can crawl only those pages and append our database. This looks like a realistic approach.
Realistic from your point of view perhaps, but what of site owners? Are you expecting that everyone will create sitemaps because your crawler needs them, when Google, Bing and other large search engines do not? You will need to work with the Web the way it is, not the way you would like it to be.
Have you read the Google links I posted earlier? (And if not, why not? )[quote=“TechnoBear, post:4, topic:287453”]
It is not required to have either; it’s left to the judgement of webmasters whether or not they need one or both. All my sites are quite small, and I don’t have a sitemap on any of them.
@Technobear, Already did. Matt cutt was saying that they crawl the same URL only when the update has been done to that URL. How do they know on such a vast Internet that what URL has been updated or any new URL added to that website?
I don’t work for Google, so I don’t know. I do know that they vary the crawl rate from site to site, depending on how frequently it is generally updated. My sites are fairly static and not crawled that often. SitePoint is crawled very frequently, because the content is ever-changing. Also, many webmasters use features such as “Fetch as Googlebot” to submit URLs.
Yes, the question is how I can do this. I created an analogy that with some plan of action google must be spidering/crawling the web pages(old, updated, new etc), but there is few privileges that only the mighty Google has such as that “fetch” method. So no need to get confused. My intention was how can I get confusing, and I was diving in googles method to see if we can do the same.
So understanding googles crawling ability was the subset of my intention to implement crawling for my own crawling ability.