SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Member
    Join Date
    Jan 2008
    Posts
    14
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Getting all links from page

    I'm not talking about a simple preg_match_all script. What if I want to get ALL links from a webpage, regardless of how they are generated. For example some links are generated by javascript feed, some are in iframe, etc. These are links that don't show in source code.

    Is there any way to just scan the page?

  2. #2
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,576
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    Methinks document.getElementsByTagName('a') would be a good place to start.

  3. #3
    SitePoint Guru
    Join Date
    Apr 2006
    Posts
    802
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    document.links returns a node list of all the a elements with href attributes,
    which may be what you want.

  4. #4
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,526
    Mentioned
    83 Post(s)
    Tagged
    4 Thread(s)
    Quote Originally Posted by mrhoo View Post
    document.links returns a node list of all the a elements with href attributes, which may be what you want.
    Is document.links updated over the lifetime of the page, or is it updated only when the page is loaded?
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  5. #5
    SitePoint Guru
    Join Date
    Apr 2006
    Posts
    802
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you set some global variable to document.links it's value is live-
    the next time it is read, it will contain the links currently contained in the document,
    including any new elements, and no longer including any that have had their src attribute removed,
    or that have been totally removed from the page

  6. #6
    SitePoint Member
    Join Date
    Jan 2008
    Posts
    14
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Mrhoo, will document.links get links regardless of how they are generated? For example if they are generated dynamically by javascript.

  7. #7
    SitePoint Wizard gRoberts's Avatar
    Join Date
    Oct 2004
    Location
    Birtley, UK
    Posts
    2,439
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    wwb_99 has given the best example I think. When ever you call getElementsByTagName it does a live search of the page, so it will always be upto date.


  8. #8
    SitePoint Member
    Join Date
    Jan 2008
    Posts
    14
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    But only if the element "a" is in the source code, right? When javascript displays links dynamically, the links don't show in the source code.

  9. #9
    SitePoint Wizard gRoberts's Avatar
    Join Date
    Oct 2004
    Location
    Birtley, UK
    Posts
    2,439
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No. Javascript accesses it in real time, so even if there are no links in the actual loaded source (whats SAVED), and then you add links using javascript, it will show the current links on the page.

    example:

    Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <title>Untitled Document</title>
            <link rel="stylesheet" type="text/css" media="screen" />
            <style type="text/css"></style>
            <script type="text/javascript">
                
                window.onload = function() {
                    
                    var urls = document.getElementsByTagName('A');
                    alert('This page has ' + urls.length + ' links hard coded. \n\n I am now going to add another 5 using Javascript. Click the button to test it!');
                    var p = urls[urls.length-1].parentNode;
                    for(var i = 4; i < 10; i++) {
                        var a = document.createElement('A');
                            a.href = '#';
                            a.appendChild(document.createTextNode(i + ' '));
                        p.appendChild(a);
                        p.appendChild(document.createElement('BR'));
                    }
                }
    
                function getLinks() {
                    var urls = document.getElementsByTagName('A');
                    alert('There are now ' + urls.length + ' links');
                }
    
                function addLink() {
                    var urls = document.getElementsByTagName('A');
                    var p = urls[urls.length-1].parentNode;
                    var a = document.createElement('A');
                        a.href = '#';
                        a.appendChild(document.createTextNode(urls.length+1 + ' '));
                    p.appendChild(a);
                    p.appendChild(document.createElement('BR'));
    
                }
    
            </script>
        </head>
        <body>
            
            <input type="button" onclick="getLinks()" value="Show how many links" /> <input type="button" value="Add Link" onclick="addLink()" />
            <br />
            <a href="#">1 </a><br />
            <a href="#">2 </a><br />
            <a href="#">3 </a><br />
    
        </body>
    </html>


  10. #10
    SitePoint Guru
    Join Date
    Apr 2006
    Posts
    802
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    the links don't show in the source code.
    Not with view source, you haven't changed the original file.

    If you read document.body.innerHTML all the current links will be included, as they will if you call document.getElementsByTagName('a') or document.links.

    The only difference between the tagName and the links methods is that document.links only returns <a> elements with a href attribute set to an url, and the tagName method returns anchors as well as links.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •