Need to retrieve an element of an HTML page

schwim · February 14, 2019, 2:44am

Hi there everyone!

I need to retrieve the latest version of an application on a download page:

But I’m not quite sure how to go about doing this. I imagine it would start with a curl retrieval but once I have the code, how do I go about getting that first version number?

Any thoughts or suggestions on the matter would be most welcome.

Thanks for your time!

John_Betong · February 14, 2019, 3:44am

Right click the web-page and select “View source”:

<?php 
// Page source:
$url = 'https://runtime.fivem.net/.../';
$src = file_get_contents($url);
$str = print_r($src);

The returned string can be parsed using PHP string functions

I would start by using:

1. $str = strstr($src, 'Parent directory/' );
a. echo $str ."\n";
2. $str = substr($str, 0, 420);
a. echo $str ,"\n";
...
...
...
$path= $str .'server.zip';

When you have retrieved the path you can use PHP curl(…);

To retrieve $path/server.zip

Edit:

I just had a thought…
if the path was generated by PHP then you could use the following:

<?php 
//
$tmp  = 'https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/';
$path = PHP_functionToGenerateLatestPath(...);
$file = 'server.zip'l

$url = $tmp .$path .$file

rpkamp · February 14, 2019, 7:27am

I’d actually use the DOM to query for the correct cell and then get the value of that cell:

<?php

$dom = @DOMDocument::loadHTML(file_get_contents(__DIR__ . '/index.html'));

$xpath = new DOMXPath($dom);

$nodes = $xpath->query('//tr[2]/td[1]');

echo $nodes[0]->nodeValue . 'server.zip';

In the example I’ve downloaded the file and am reading it from disk, but downloading it and parsing that is not that hard. I’ll leave that as an exercise to the reader

Also, this code represents only happy path, you’d have to add any guards for unexpected stuff (node cannot be found, etc).

m_hutley · February 14, 2019, 9:42am

Standard Disclaimer: Only use these tools on sites you have permission to do so after checking their terms and conditions.

Assuming the structure of the page remains the same, and that the desired link is always the top one, then it would be the 8th <a> tag on the page. Or maybe the 5th. Cant tell if those arrows are separate links or not.

schwim · February 14, 2019, 12:58pm

Hi there guys and thanks very much for the help!

I just read back on my OP and realized I didn’t explain myself clearly.

I do not need the zip file itself. I only need to know which version is the latest, which is the highlighted number in the image. I need to get that 1045 into a var.

Would these suggestions still be usable for what I’m trying to do?

m_hutley · February 14, 2019, 1:03pm

take the text content of the link (@rpkamp’s post #3), and instead of substr’ing it (@John_Betong’s post #2), str_split it on the hyphen, and take the first element of the split array.

schwim · February 14, 2019, 1:34pm

I got the version!

<?php

$dom = @DOMDocument::loadHTML(file_get_contents(__DIR__ . '/fivem.html'));
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//tr[2]/td[1]');
$version = explode("-", $nodes[0]->nodeValue);
echo $version[0];

?>

Is this an ok way to have handled it or am I overlooking something(s)?

m_hutley · February 14, 2019, 1:46pm

Other than the caution that rpkamp mentioned about ‘unexpected stuff’, looks fine to me.

rpkamp · February 14, 2019, 2:12pm

Looks fine. Just take care that it won’t work when allow_url_fopen is disabled on a host, so it would be more portable if you used something like curl.

schwim · February 14, 2019, 3:11pm

I don’t see where allow_url_fopen is being used here, do you mean when I write the portion to retrieve the HTML?

rpkamp · February 14, 2019, 3:12pm

allow_url_fopen is a php.ini setting.

When it’s set to 0 then file_get_contents refuses to fetch URLs.

schwim · February 14, 2019, 3:17pm

Ahh, I see, file_get_contents needs to be changed. Will do that and post the updated code.

schwim · February 14, 2019, 3:22pm

Is it less likely for a server to have curl disabled than it is allow_url_fopen? I ask because I wonder if it would be worth the effort to have a fallback of trying allow_url_fopen if curl fails or if curl is common / expected.

$ch = curl_init("https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);

$dom = @DOMDocument::loadHTML($content);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//tr[2]/td[1]');
$version = explode("-", $nodes[0]->nodeValue);
echo $version[0];

droopsnoot · February 14, 2019, 6:58pm

The only issue I see with this is that you’re relying on the latest one being at the top of the list. Is that something you have control over, or could the server owner change that without warning? I think I’d be tempted to get all the entries and their corresponding dates, then find the latest date, and get the version associated with it.

But of course if you’re sure it will always be at the top, that’s fine.

Mittineague · February 14, 2019, 7:27pm

That looks to be the current default order. Maybe just to make sure the query string bit could be added? i.e.
https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/?C=M&O=D

C, M, O, and D don’t give much idea as to what they are, and I guess those too could be changed at any time. The numbering and dates look to be incremental, maybe putting a “new is greater than last known” check in there somewhere would be a good idea.

John_Betong · February 15, 2019, 4:57pm

I was curious and after further investigation discovered:

the link supplied shows a directory listing
a. Apache creates a web-page with a table of the directory contents
b. directory contents can be file names, dates, links to sub-directories, etc
c. Apache uses a style sheet to format the content
d. Apache lists the contents in a specified order.

I used a couple of browsers to view the directory listing and they were all very similar so…

Using PHP to extract the latest version number after Parent directory/ string followed by href=

<?php 
declare(strict_types=1);

echo '<br><br> line: ',__line__, ' ==> ',
$url = 'https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/';

$result = file_get_contents($url);

$tmp 	= strstr($result, 'Parent directory/');
$tmp 	= htmlspecialchars($tmp);
$tmp 	= strstr( $tmp, 'href=');

echo '<br><br> line: ',__line__, ' ==> ',
$tmp 	= strstr( $tmp, '-', true );

echo '<br><br> line: ',__line__, ' ==> ',
$tmp 	= substr( $tmp, 11  ); // because it is &quoe; and not ""

Output:

schwim · February 19, 2019, 10:27pm

Hi there John and thanks so much for the help!

I took your fantastic code and made a function of it. The function works if I comment the strict_types declaration but will not work if it’s included.

I looked up strict_type command and don’t fully understand. it. The function seems to work just fine without it but I wanted to ask it’s purpose and whether it’s truly necessary in my case and if so, what I can do to use it in the function?

function versionGrab($os){

	//declare(strict_types=1);

	if($os == windows){
		$url = 'https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/';
	}else{
		$url = 'https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/';
	}

	$ch = curl_init($url);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
	$content = curl_exec($ch);
	curl_close($ch);

	$tmp 	= strstr($content, 'Parent directory/');
	$tmp 	= htmlspecialchars($tmp);
	$tmp 	= strstr( $tmp, 'href=');
	$link 	= strstr( $tmp, '/', true );
	$link 	= substr( $link, 11  ); // because it is &quoe; and not ""
	$return['link'] = $url.$link;
	$return['version'] = strstr( $link, '-', true );
	
	return $return;
	
}

$latest = versionGrab('windows');

echo '<a href="'.$latest['link'].'/server.zip">'.$latest['version'].'</a>';

John_Betong · February 20, 2019, 5:54am

Try adding these lines tp the file because it looks as though your default PHP.ini file has the default values set to *Off.

Please note errors and warnings should be displayed in the browser rather than being logged in the /log/errors.log file.

<?php 
  declare(strict_types=1); // must be the first declaration
  error_reporting(-1); // maximum error reporting
  ini_set('display_errors', 'true'); // display results to screen instead of errors.log

// remaining script

One reason the script works when declare(…) is removed or commented is that the declaration must be the first declaration in the file. Your PHP.ini file defaults are preventing the errors from showing in the browser which is the online default which could show users sensitive information, passwords, etc

system · May 22, 2019, 12:54pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Checking versions with php PHP	3	580	July 16, 2011
Get a html tag and store in a php variable PHP	3	10447	October 8, 2014
Need some help with scope of programming PHP	17	1609	October 8, 2014
Work with html dom for crawl websites PHP scripts	4	1567	August 20, 2016
Can anyone help me grab this variable as a response from the visited site? PHP	4	865	December 1, 2021

Need to retrieve an element of an HTML page

Edit:

Output:

Related topics