SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Member
    Join Date
    Oct 2011
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How do I create an index from an existing website?

    I'd like to make an online index of existing web pages.
    The website is not mine, but it doesn't have a search tool nor have it anytime soon.

    I can download them all to my local computer, and make them all wordpress pages (I'm good at it, but not at SQL) but I think my missing link is how to correlate the content with the real online page. If I had an existing tool / system to index pages that would probably fill in the gap, because I don't really need the content other than to create the index. After that, the content is useless.

    Any idea?

  2. #2
    SitePoint Member
    Join Date
    Jan 2012
    Location
    Chennai
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    you can check for sitemap of the website,if it exists.Generally,wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.If it's not a blog,then you are having a real problem

  3. #3
    SitePoint Member
    Join Date
    Oct 2011
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok, I've given a php script:
    The problem is… I can't get it to browse as a browser's agent and it keeps relying on the robots.txt file, failing to index the pages marked as disallow… or at least so says the error message. I tried to change the if conditionals in a few places, to make it not to find the robots file, or ignore it, but it didn't work. sphider-plus worked the same.
    If anyone knows how to do it, I'd appreciate the tip.
    Thanks.

  4. #4
    SitePoint Member
    Join Date
    Mar 2012
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by rammurtee View Post
    you can check for sitemap of the website,if it exists.Generally,wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.If it's not a blog,then you are having a real problem
    wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.

  5. #5
    SitePoint Enthusiast SitemapGenerator's Avatar
    Join Date
    Nov 2007
    Posts
    90
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you switch off easy mode in A1 Sitemap Generator, you configure "webmaster filters" tab to ignore nofollow, noindex, robots.txt etc. (My guess is that other crawler/sitemapper solutions may have similar options if you look for them. Try ask the developer of the script/program you use!)
    A1 Website Analyzer - Fix broken links, duplicate titles, custom text search, sculpt links
    A1 Sitemap Generator - Build xml, video, image, mobile, visual HTML/CSS sitemaps
    :: WebHelpForums.Net :: Support forum for the A1 tools suite.

  6. #6
    SitePoint Member
    Join Date
    Oct 2011
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just to stop the ball of comments not related to my case:

    As I said before, the site is not mine.
    It doesn't allow robots somehow, because I can browse the pages but sphider can't get them.
    The site is OLD, custom made. It's not wordpress nor uses plugins.
    If it does, I don't have access to it to edit a thing.

    Let's start from the fact that the site is already as I said, no way to change it, and I'll access it from an external server.

    It might have a "search" feature, but I don't know all the common query strings/variables commonly used by old forum platforms for me to test it. (again, I think it was completely custom made, but you know, there has always been trends in programming)

    Can you tell?
    The urls end with (e.g.) /org_board.show_msg?an_msg_id=1787112
    and the main boards page is at /org_board.p_main
    Can you guess the search query?

  7. #7
    SitePoint Mentor bronze trophy
    John_Betong's Avatar
    Join Date
    Aug 2005
    Location
    City of Angels
    Posts
    1,833
    Mentioned
    73 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by sergiozambrano View Post
    Just to stop the ball of comments not related to my case:
    Take a look at Xenu, the reporting feature of external sites is comprehensive - it may help.



    http://home.snafu.de/tilman/xenulink.html
    Learn how to be ready for The New Move to Discourse

    How to make Make Money Now with a *NEW* look

    Be sure to congratulate Patche on earning Member of the Month for July 2014

  8. #8
    SitePoint Enthusiast SitemapGenerator's Avatar
    Join Date
    Nov 2007
    Posts
    90
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It sounds straightforward. (But might not be.) No matter what crawler software you end up using, you should just make sure to have it ignore robots.txt, nofollow/noindex instructions. Also from your description, your website URLs have non-standard file extensions which means you should probably remove the file extensions list(s) in whatever crawler software you use and depend on MIME types instead. (That would e.g. be relevant for my suggestion at least.) If you have trouble indexing your website with some crawler tool, I recommend you contact the developer of the tool and ask.
    A1 Website Analyzer - Fix broken links, duplicate titles, custom text search, sculpt links
    A1 Sitemap Generator - Build xml, video, image, mobile, visual HTML/CSS sitemaps
    :: WebHelpForums.Net :: Support forum for the A1 tools suite.

  9. #9
    SitePoint Member
    Join Date
    Oct 2011
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy UPDATE

    Stupidly I didn't check HOW the links appear, just where the links pointed to.
    It seems the links open the pages I want with JavaScript, which Sphider can't process.

    At least I know how the pages are called, and I can increment the query string while downloading.

    Is there any php script or Mac Software (or Firefox/Chrome extension?) to download webpages from a url range?
    That won't index the original pages but I'll be able to create a DB I can work with.

    Any idea?

  10. #10
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I haven't come across one, but it'll do you some good to learn how to do it with PHP yourself. Dependence on available software to do a basic task like that is never a good thing.
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •