Extracting Items From Webpage

Folks,

This is how I would extract links off from a webpage:

<?php 

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);
//Get all links
if($links = $document->getElementsByTagName('a')) {
    //Loop through all links
    foreach($links as $node) {
        //Get link location (href)
        $link_href = $node->getAttribute('href');
        //Get link text
        $link_text = $node->nodeValue;
    }
}

?> 

This is how I would extract images off from a webpage:

<?php 

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);
//Get all links
if($links = $document->getElementsByTagName('img')) {
    //Loop through all links
    foreach($links as $node) {
        //Get source of the image (src attribute)
        $img_src = $node->getAttribute('src');
        //Get alt text of the image (alt attribute)
        $img_alt = $node->getAttribute('alt');
    }
}

?> 

This is how I would extract jSON off from a webpage:

<?php 

//Assuming your contents are in a vairable called $contents
//Check if the JSON is valid
//Attempt to decode; return true for valid if no errors were found.
//Otherwise return false for an error
function checkIfJSONValid($t) {
    json_decode($t);
    if(json_last_error() == JSON_ERROR_NONE) {
        return true;
    }
    return false;
}
//Match all JSON and filter for valid JSON contents
$json_matches = Array();
$pattern = '/\{(?:[^{}|(?R))*\}/x';
preg_match_all($pattern, $contents, $json_matches);
$json_valid = array_filter($json_matches, 'checkIfJSONValid');
//Loop through all valid JSON strings
foreach( $json_valid as $json ) {
    //Decode JSON
    //Second parameter specifies to use an associative array for the decoded JSON data
    $data = json_decode($t, true);
    //JSON is now in an array in the $data variable
}


?> 

But …

Q1. How to extract an email address off from the webpage ?
Care to show me a code sample ?

Q2. How to extract Page Title off from the webpage ?
Care to show me a code sample ?

Q3. How to extract Meta Keywords off from the webpage ?
Care to show me a code sample ?

Q4. How to extract Meta Description off from the webpage ?
Care to show me a code sample ?

You may take my above codes and modify and then paste here for us newbies to learn from. Do you see how similar all my 3 codes look like ? How-about showing me 4 more similar codes to extract the 3 things I just mentioned ?

  1. there’s no specific element for an email address, so you first have to determine where it is, in an anchor as mailto: link, or in some text element

  2. It’s in the title tag, just use getElementsByTagName accordingly

  3. like 2., but in the meta element, you may have to check on attributes too: getAttribute()

  4. see 3.

No. Or at least, i will not. It’s not “learning” if you just copy some codes instead of modifying them to your needs. It’s your task to show affort on solving your issues.

4 Likes

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.