Extracting Items From Webpage


This is how I would extract links off from a webpage:


//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
//Get all links
if($links = $document->getElementsByTagName('a')) {
    //Loop through all links
    foreach($links as $node) {
        //Get link location (href)
        $link_href = $node->getAttribute('href');
        //Get link text
        $link_text = $node->nodeValue;


This is how I would extract images off from a webpage:


//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
//Get all links
if($links = $document->getElementsByTagName('img')) {
    //Loop through all links
    foreach($links as $node) {
        //Get source of the image (src attribute)
        $img_src = $node->getAttribute('src');
        //Get alt text of the image (alt attribute)
        $img_alt = $node->getAttribute('alt');


This is how I would extract jSON off from a webpage:


//Assuming your contents are in a vairable called $contents
//Check if the JSON is valid
//Attempt to decode; return true for valid if no errors were found.
//Otherwise return false for an error
function checkIfJSONValid($t) {
    if(json_last_error() == JSON_ERROR_NONE) {
        return true;
    return false;
//Match all JSON and filter for valid JSON contents
$json_matches = Array();
$pattern = '/\{(?:[^{}|(?R))*\}/x';
preg_match_all($pattern, $contents, $json_matches);
$json_valid = array_filter($json_matches, 'checkIfJSONValid');
//Loop through all valid JSON strings
foreach( $json_valid as $json ) {
    //Decode JSON
    //Second parameter specifies to use an associative array for the decoded JSON data
    $data = json_decode($t, true);
    //JSON is now in an array in the $data variable


But …

Q1. How to extract an email address off from the webpage ?
Care to show me a code sample ?

Q2. How to extract Page Title off from the webpage ?
Care to show me a code sample ?

Q3. How to extract Meta Keywords off from the webpage ?
Care to show me a code sample ?

Q4. How to extract Meta Description off from the webpage ?
Care to show me a code sample ?

You may take my above codes and modify and then paste here for us newbies to learn from. Do you see how similar all my 3 codes look like ? How-about showing me 4 more similar codes to extract the 3 things I just mentioned ?

  1. there’s no specific element for an email address, so you first have to determine where it is, in an anchor as mailto: link, or in some text element

  2. It’s in the title tag, just use getElementsByTagName accordingly

  3. like 2., but in the meta element, you may have to check on attributes too: getAttribute()

  4. see 3.

No. Or at least, i will not. It’s not “learning” if you just copy some codes instead of modifying them to your needs. It’s your task to show affort on solving your issues.


This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.