Searching a whole online website if pages are not indexed?

Hi there,

I am wondering if it is possible to search for a key word on an entire website if certain pages are not indexed or there is no search on the site? Would this be in anyway possible?

Any thought would be great, thanks!

In code, or just from a browser? I could see how it might be possible (but not necessarily easy) to write some code to crawl the site, extract the visible data, and then perform the search. I could also think of a few ways that the site owner could prevent that, or at least make it difficult.

It would be from the browser or code or cmd I guess?

For example, if I wanted to find the word “business” on the entire BBC website (I know that is huge and a keyword would return so many results, but just an example), without using google or the BBC website itself, e.g their own search, how could I do this? Then to return a list of all the URLS that is does appear in. So I guess in a way, Google doesn’t really come into this.

Not sure if that makes sense.

You have two scenarios, one hard, the other harder:

  1. If the site is static, you crawl the site collecting all links and download each page. Then search those pages for your keyword. See CURL() to get started.
  2. If the site isn’t static, the search string you’re looking for may be in a database and only appear on the page given the right secret handshake, or every odd tuesday of the month. If you just process it as in method #1, you may or may not find the keyword you’re looking for. To get all occurrences of the keyword, you would need to be able to provide secret handshakes and run your crawling application every other tuesday to catch that search string.

Not to be pedantic, but if a page is not indexed, there’s a reason. The site owners have decided a) it’s not for public consumption or b) it’s not relevant/current anymore

You can do a site specific search on google for a term (you add site:example.com)

Or if you really want to go down a rabbit hole, you can look into domain specific search engines.

But again, if it’s not indexed, it shouldn’t be considered valid…

1 Like

Although search engines let you look for content that is in their index, that does not mean the content isn’t actually there.
As for your question, it would be better to start at search engines like DuckDuckGo, Bing, and Yandex and find content there to find websites that are not listed on Google.

No, it is not possible to search for a keyword across an entire website if certain pages are not indexed by search engines and there is no search functionality provided on the site.

It depends on the specifics of that.

In a previous reply the suggestion is to gather all the links then search those. That might work except it would miss all the pages that do not have references (links) to them. Usually if a website does not have a reference to a page then the page has no use and/or is old. One problem with gathering links then searching the referenced pages is that links are often generated by JavaScript; JavaScript can fill in the href then cause the navigation to occur. I do not know how Google finds pages that are navigated to that way, maybe it does not.

Websites can allow us to get the folders of the website (at least the folders under the root shown to the browser) and the files in folders but that feature is turned off for most sites. And for many technologies the physical files do not match the logical files sent to the browser.

The problem is finding the pages to search.

Searching each page (that can be received by a browser) is relatively easy. The easiest way would be to just get the plain text (innerText) of the entire page. An alternative would be (I am not sure of the details here) to walk the DOM tree for all the text elements. I think there is a DOM function that can make it easier to navigate the DOM but it was new when I tried to use it a few years ago and did not get it to work. (Perhaps the main problem was that I was using C# and I just did not know how to get C# and JavaScript to work together.)

Would this be done in a browser, such as in a browser extension, or from a desktop program?

Each page of the website could be indexed into OpenSearch which is a very powerful, flexible search system. For smaller sites creating a small node program with PageFind might work well.

If you’ve gotta register each page, you might as well hit Ctrl-F and paste the keyword into the search box of your browser on each page…

If not indexed then it will not be shown. To check in indexing press ctrl+k then ctrl+v on search bar. :saluting_face: