How To Extract Page Title And Meta Keywords & Descriptions?

Folks,

I am trying to extract page title and meta tags (meta keywords, meta description) from the doanloaded page.
How to do it ?
Ok, let us startoff with this code that extracts links from the downloaded page.

<?php 

//Code from: https://potentpages.com/web-crawler-development/tutorials/php/techniques 

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);
//Get all links
if($links = $document->getElementsByTagName('a')) {
    //Loop through all links
    foreach($links as $node) {
        //Get link location (href)
        $link_href = $node->getAttribute('href');
        //Get link text
        $link_text = $node->nodeValue;
    }
}

?> 

To extract the title, I changed this line from:

if($title = $document->getElementsByTagName('a')) {

to this:

if($title = $document->getElementsByTagName('title')) {

And this:

$link_href = $node->getAttribute('href');

to this:

$title = $node->getAttribute('title');

Q1. Did I do correct or what ?

Q2. Now help me extract the meta keywords and the meta description.
Are these ok ?

<?php 

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);
//Get all meta keywords
if($meta_keywords = $document->getElementsByTagName('meta keywords')) {
    //Loop through all links
    foreach($meta_keywords as $node) {
        //Get link location (href)
        $meta_keywords = $node->getAttribute('meta keywords');
        $meta_keywords = $node->nodeValue;
    }
}

?> 

What about this ?

<?php 

//Code from: https://potentpages.com/web-crawler-development/tutorials/php/techniques 

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);

if($meta_description = $document->getElementsByTagName('meta description')) {
    //Loop through all links
    foreach($meta_description as $node) {
        //Get link location (href)
        $meta_description = $node->getAttribute('meta description');
        $meta_description = $node->nodeValue;
    }
}

?> 

What you say about these attempts ?

I cannot test my codes right now as I would need cURL to fetch the page first before downloading it and extracting the title and meta stuffs. Struggling with the extractioning parts which I mentioned here:
https://www.sitepoint.com/community/t/warning-domdocument-loadhtml-misplaced-doctype-declaration-in-entity/340499/3

Couldn’t you load from a file, just for testing purposes?

Frankly, I do not know the code to load from file. Or, I have forgotten it.

As you want to get the contents of a file, you could use file_get_contents().

https://www.php.net/manual/en/function.file-get-contents.php

1 Like

You mean like this:

<?php 

//CODE FROM: https://stackoverflow.com/questions/3711357/getting-title-and-meta-tags-from-external-website

function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl("http://google.com/");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
    $meta = $metas->item($i);
    if($meta->getAttribute('name') == 'description')
        $description = $meta->getAttribute('content');
    if($meta->getAttribute('name') == 'keywords')
        $keywords = $meta->getAttribute('content');
}

echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";

?>

I am getting error, even though the variables are defined:

Title: Google

Notice : Undefined variable: description in C:\xampp\htdocs\work\extract_metas.php on line 40
Description:

Notice : Undefined variable: keywords in C:\xampp\htdocs\work\extract_metas.php on line 41
Keywords:

They’re only defined if they are returned from $meta->geAttribute(), otherwise they are not defined. From this you can gather that they are not returned from that function.

Folks,

I am spoilt for choice as the folks at StackoverFlow are in different opinions. And so, I’d rather stick to your opinions. Which code from StackOverFlow do you reckon is solid so there will be no misses to grab the page title and meta stuffs ?

Whichever one works most reliably when you test it?

1 Like

Not testing it as it is obvious sometimes a code works on one website to grab it’s title etc. but does not work on another website. I can’t be testing each code on each and every website in the world and so prefer php experts’ advice which bit of code they deem is perfect. I’d rather trust your opinion once you’ve looked over each codes. The guys at StackOverFlow cannot come to a single decision. Can’t rely on them here, I’m afraid. And so, best I rely on folks here.

The thing is, though, SO isn’t that different from here (or any other discussion forum), in that anyone can join up and post, and in that there are plenty of subjects where you’ll see disagreement from posters on the “best” way to do things. Not everything, obviously there are things that are just plain wrong, but many things have several equally good solutions that some will prefer over others.

I can’t comment further on your subject because I have no direct experience of it. Whatever method you pick, you’ll probably run into a site that breaks it. Especially if you don’t test it and just rely on some stranger to tell you what is best.

2 Likes

If you can’t be bothered to test it, don’t expect others to help then.

2 Likes

Mmm. So you never built even a tiny winy meta extractor while you werew learning fetching pages with cURL ?

Actually, I’ve been testing now the 3 code samples that deal with DOM and none of them work.
And so, I ask: Just by looking at the codes over there, which ones you give good rankings ? I’ll try to get those to work.

No, just isn’t the kind of thing I’ve ever needed to do. I did have a use for something similar, but it proved to be easier to write a browser extension to parse the DOM, for various reasons.

Ok. I understand.
But see if you can see why this code is not echoing anything on screen and why it is showing blank page …

<?php 

$url = "http://google.com"; 

function fetch_meta_tags($html) { 

    $html = curl_get_contents($url); 
    $mdata = array(); 

    $doc = new DOMDocument();
    $doc->loadHTML($html);

    $titlenode = $doc->getElementsByTagName('title'); 
    $title = $titlenode->item(0)->nodeValue;

    $metanodes = $doc->getElementsByTagName('meta'); 
    foreach($metanodes as $node) { 
    $key = $node->getAttribute('name'); 
    $val = $node->getAttribute('content'); 
    if (!empty($key)) { $mdata[$key] = $val; } 
    }

    $res = array($url, $title, $mdata); 

    return $res;
	//I ADDED THESE LINES OF CODE BECAUSE NOTHING WAS GETTING ECHOED. WAS SEEING BLANK PAGE ....
	echo "Title: $title". '<br/><br/>';
	echo "$mdata"; 
	echo "$res";
	echo "$val";
	echo ""; 
}

?>

Code from StackOverFlow.

Open up google.com in your browser, right-click and “view source”, in all that js code can you make out any of the attributes you’re searching for? If I search for “meta name” in the source, there are no results. So that would be a reason it doesn’t work on that site. Presumably they’re using JS to set the page title. They probably don’t feel the need to set meta keywords - does anyone still use those for anything?

In any case, try a different site, one that you know has got those attributes set.

ETA - the reason your page is blank is because all the work is inside a function, and you never call the function. You just echo some variables that you never created.

In these lines:

$url = "http://google.com"; 

function fetch_meta_tags($html) { 

    $html = curl_get_contents($url); 

you assign a value to $url, you define a function that takes a parameter called $html, but inside the function you use $url again, but (even if you did call the function), would be blank because the one you defined outside the function is not in scope. Where do you define the curl_get_contents function.

The variables you echo only exist inside the function, so they have no value at the point that you echo them. Have a read up on functions, passing data into them, and getting the return values from them.

Hmm, exactly like that? That’s a problem with relying on code from random strangers. Yes, including me.

I would hazard a guess that the URLs tested return a 301 http_response to a more secure Https URL of the same site.

Try this:

https://supiet2.tk/

Uh, Mr John,

You do realize your app is vulnerable to an XSS attack and that I can see your php code in many files right?

* By the way, your vd function is rather funny.

Unfortunately I do not and would be grateful to learn how to make the site more secure.

By the way, your vd function is rather funny.

My typing skills are not good which accounts for renaming and enhancing the frequently called var_dump function. I find that typing vd($var); so much easier and especially to remember :slight_smile:

  1. NEVER trust user input
  2. NEVER trust user input
  3. NEVER trust user input
  4. Turn off error reporting on a public server.

It would be a good idea to take the script offline until you fix it. There is quite a number of malicious things a bad actor could do with it.

* The funny part about the vd function is what you put for the $val parameter default.

1 Like