SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Member
    Join Date
    Sep 2004
    Location
    canada
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Extracting information from html

    Hi ppl!

    I need some help with extracting information from html.

    I'll give a brief description of my problem. I've got a webpage where are there are several links displayed. Let's say there are 1000's of links. So without clicking on each link and copying the information from the page displayed, is there a way by which I can extract the information contained in the links without having to open those pages.

    Hope I've made myself clear.

    Thanks for taking time to read this.

    Take care

    Grek

  2. #2
    $this->toCD-R(LP); vinyl-junkie's Avatar
    Join Date
    Dec 2003
    Location
    Federal Way, Washington (USA)
    Posts
    1,524
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If I understand what you're asking, you can use a title attribute with each link. For example:
    Code:
    <A HREF="http://www.mysite.com/" title="Page content description here">
    My Link</A>
    When you mouseover the link, it will show you a description of the page's content.

    Hope this is what you're looking for.
    Music Around The World - Collecting tips, trade
    and want lists, album reviews, & more
    Showcase your music collection on the Web

  3. #3
    gingham dress, army boots... silver trophy redux's Avatar
    Join Date
    Apr 2002
    Location
    Salford / Manchester / UK
    Posts
    4,838
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    i think you're after an offline browser / spider
    http://www.google.com/search?q=offline+browser+spider
    re·dux (adj.): brought back; returned. used postpositively
    [latin : re-, re- + dux, leader; see duke.]
    WaSP Accessibility Task Force Member
    splintered.co.uk | photographia.co.uk | redux.deviantart.com

  4. #4
    SitePoint Member
    Join Date
    Sep 2004
    Location
    canada
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Extracting......

    Quote Originally Posted by vinyl-junkie
    If I understand what you're asking, you can use a title attribute with each link. For example:
    Code:
    <A HREF="http://www.mysite.com/" title="Page content description here">
    My Link</A>
    When you mouseover the link, it will show you a description of the page's content.

    Hope this is what you're looking for.
    ********

    Thanx for the reply. But I guess thatz not exactly what I wanted. U have mentioned of the contents being displayed when mousing over the link.

    Let me put it in simpler terms.

    In the webpage, each link contains some information that I need to download. As there are an enormous number of links involved, I'm looking for a better way to extract(download) the info contained without having to open up each link.

    Hope I've made myself clear this time around!

    Thanks

    Grek

  5. #5
    Application Developer shabbirbhimani's Avatar
    Join Date
    Apr 2004
    Location
    India
    Posts
    2,272
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In the webpage, each link contains some information that I need to download. As there are an enormous number of links involved, I'm looking for a better way to extract(download) the info contained without having to open up each link.
    Try this
    http://www.download.com/3000-2377-10277606.html

    Quote Originally Posted by www.cnet.com
    HTTrack is a free and easy-to-use offline browser utility. You can download the contents of entire Web sites from the Internet to a local directory for offline viewing. Simply open a page of the mirrored Web site in your browser and browse the site link by link as if you were viewing it online. HTTrack also can update existing mirrored sites and resume interrupted site downloads. The program is fully configurable and includes an integrated help system. It crawls M3U and AAM files and can cache to a ZIP file. It also handles CSS.

  6. #6
    Non-Member Big Fat Bob's Avatar
    Join Date
    Sep 2004
    Location
    United Kingdom (Come)
    Posts
    79
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yo

    I would use the DOM extension to get each A tag and it's attributes, such as the HREF.

    Then, once you have this, use PHPs own sockets to open the file, and then you can do what ever you want with it

    Word of warning though, I would be wary if there are a lot of links, you may want to create a batch process instead of running it off a browser, if this was your intentions.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •