RSS feeds or may be something else

I think RSS feeds helps one understand what links are active on a website. Is there any other website.

I am asking this because I wanted to know how can we know about all the links that exist for a website, and is there a method to know what new links are added and what already existing links are updated, changed or perhaps deleted.

A sitemap sounds like exactly what you’re looking for.

1 Like

Can a sitemap be found on every website? Do all webmasters create sitemaps for their respective websites?

I have one more question if you can answer:
Is crawling different from scrapping. Search engines such as Google, Yandex they crawl every website and store the content/data on their servers and then deliver results. When they crawl aren’t their IP’s blocked. How does such a huge crawling is accomplished by search engines bots?

P.S. → Please advise if my second question is different, and I need to open a new thread.

No and no.

There are two types of sitemap - xml for search bots and html for human visitors. It is not required to have either; it’s left to the judgement of webmasters whether or not they need one or both. All my sites are quite small, and I don’t have a sitemap on any of them.

I’m not sure I understand your second question, but this video might help:

https://www.youtube.com/watch?v=KyCYyoGusqs

1 Like

No No.

To crawl a website you need 2 things:

  1. Domain Name and
  2. All the links to the domain names.

If you do not have the sitemap of a website how will you know all the links to that domain?

If the site is properly constructed, you should only need the URL of one page. It should be posible to discover all the other pages by following links. Did you read the Google links I provided above?

What, exactly, is behind your questions here? Are you wondering about how search bots crawl your sites, or are you asking how you would spider another site?

2 Likes

I have a dream to build an analytics company in future for that I will need to crawl the entire domain list. I presume that hosting companies will block IP’s also or perhaps there will be some other way around to tell them that we are not spammers we are doing the similar job what search engines do.

So to accomplish this the basic fundamental logic is this:

  1. List of the domain, which is available in the market.
  2. Crawler script(I know how difficult it could be to have a highly scalable crawlers script)
  3. Domains sitemap; that means all the URL’s of a particular domain.

The links are on the pages of the site, usually in the nav menu, but some are found elsewhere on the page.
Crawlers don’t need a sitemap, it just helps them if they have difficulty finding some pages.

2 Likes

Thanks.

But I think it is not possible to crawl the entire web quite often(every day, every week or perhaps every month). If we have a sitemap(or some other similar method) that tells us if the page has been updated(or the new page added) then perhaps we can crawl only those pages and append our database. This looks like a realistic approach.

Yes, but if you are talking about other people’s sites, which you have no control over, those sites may not have a site map, not every site does.

2 Likes

Realistic from your point of view perhaps, but what of site owners? Are you expecting that everyone will create sitemaps because your crawler needs them, when Google, Bing and other large search engines do not? You will need to work with the Web the way it is, not the way you would like it to be.

2 Likes

So Google, Yandex etc can find all pages w/o a sitemap all the time.?

I am not assuming anything If I would have known the whole dynamics of crawling then I wouldn’t have posted the question.

The question is posted what you/we do not know something, anything or everything about. It is a hunt for knowledge, the missing gap.

Have you read the Google links I posted earlier? (And if not, why not? )[quote=“TechnoBear, post:4, topic:287453”]
It is not required to have either; it’s left to the judgement of webmasters whether or not they need one or both. All my sites are quite small, and I don’t have a sitemap on any of them.
[/quote]

Yes, that is exactly the message.

2 Likes

Good.

And what did you learn from them about the need for a sitemap?

1 Like

@Technobear, Already did. Matt cutt was saying that they crawl the same URL only when the update has been done to that URL. How do they know on such a vast Internet that what URL has been updated or any new URL added to that website?

I don’t work for Google, so I don’t know. I do know that they vary the crawl rate from site to site, depending on how frequently it is generally updated. My sites are fairly static and not crawled that often. SitePoint is crawled very frequently, because the content is ever-changing. Also, many webmasters use features such as “Fetch as Googlebot” to submit URLs.

1 Like

Right, but that is the privilege that the mighty Google has. I am not sure if they share that domain information with us through some API.

I’m getting confused here. You asked how Google does it, and I replied. If you’re wondering how you would do it, that’s a different question.

1 Like

Yes, the question is how I can do this. I created an analogy that with some plan of action google must be spidering/crawling the web pages(old, updated, new etc), but there is few privileges that only the mighty Google has such as that “fetch” method. So no need to get confused. My intention was how can I get confusing, and I was diving in googles method to see if we can do the same.

So understanding googles crawling ability was the subset of my intention to implement crawling for my own crawling ability.