SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    [Regex] Searching for .pdf documents in source code

    Hey,

    For a project I'm building, I want to look through the source code of a document, and pull out any .pdf links that it refers to.

    I'd like to link to the documents that I find, and so I would like to get the full path to the file including the domain.

    I've been reading up on regex and came up with the following:

    Code PHP:
    # Search for full URL PDF documents
    $regex_1 = "/^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\/[a-zA-Z0-9_-]*\/*[a-zA-Z0-9_-]+\.pdf/";
    preg_match_all($regex_1, $page_contents, $matches1);
    foreach($matches1[0] as $value1) {
    	echo $value1 . "<BR>";
    }
     
    # Search for relative path documents
    $regex_2 = "/[a-zA-Z0-9_-]*\/*[a-zA-Z0-9_-]+\.pdf/";
    preg_match_all($regex_2, $page_contents, $matches2);
    foreach($matches2[0] as $value2) {
    	echo $value2 . "<BR>";
    }

    The first regex which I want to find all full links to pdf documents (www.website.com/link/to/pdf.pdf) returns nothing, and the second regex which should find the relative links (path/to/pdf.pdf) works, except for multiple folders to the file:

    path/to/pdf.pdf
    Becomes:

    to/pdf.pdf
    I realise the second regex will pick up on the first ones, I figured I could just check if the document already exists in the database before dealing with it.

    Hope this makes sense,

    Thanks,
    Chris

  2. #2
    SitePoint Addict Wildhoney's Avatar
    Join Date
    Apr 2006
    Location
    Nottingham
    Posts
    246
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $szDocument 'Bleh<a href="http://www.website.com/link/to/pdf.pdf" class="test">Content here</a><input>';

    preg_match_all('/<a .*href="(.*\.pdf)" .*>/iUs'$szDocument$aMatches);
    print_r($aMatches); 
    TalkPHP.com - The Friendly PHP Community

    Watch Reaper Online - Watch Chuck Online

  3. #3
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Wildhoney View Post
    PHP Code:
    $szDocument 'Bleh<a href="http://www.website.com/link/to/pdf.pdf" class="test">Content here</a><input>';

    preg_match_all('/<a .*href="(.*\.pdf)" .*>/iUs'$szDocument$aMatches);
    print_r($aMatches); 
    Thank you, but that seems to echo out the entire source code of the page if I do it for a source code.

    Any ideas why?

  4. #4
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $s 'Bleh<a href="http://www.ffhfh.com/link/rtreytry/pdf.pdf" class="test">Content here</a><input>
    Bleh<a href="http://www.website.fh/link/to/pdf.pdf" class="test">Content here</a><input>
    Bleh<a href="http://www.gfhgfrtss.com/link/to/pdf.pdf" class="test">Content here</a><input>
    Bleh<a href="http://www.fgfh.com/link/to/pdf.pdf" class="test">Content here</a><input>'
    ;

    preg_match_all('/href="([^"]+\.pdf)"/i'$s$pdfs);
    $pdfs $pdfs[1];
    var_dump($pdfs
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  5. #5
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Awesome. Thank you both.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •