Need to retrieve an element of an HTML page

#1

Hi there everyone!

I need to retrieve the latest version of an application on a download page:

But I’m not quite sure how to go about doing this. I imagine it would start with a curl retrieval but once I have the code, how do I go about getting that first version number?

Any thoughts or suggestions on the matter would be most welcome.

Thanks for your time!

#2

Right click the web-page and select “View source”:

<?php 
// Page source:
$url = 'https://runtime.fivem.net/.../';
$src = file_get_contents($url);
$str = print_r($src);

The returned string can be parsed using PHP string functions

I would start by using:

1. $str = strstr($src, 'Parent directory/' );
a. echo $str ."\n";
2. $str = substr($str, 0, 420);
a. echo $str ,"\n";
...
...
...
$path= $str .'server.zip';



When you have retrieved the path you can use PHP curl(…);

To retrieve $path/server.zip

Edit:

I just had a thought…
if the path was generated by PHP then you could use the following:

<?php 
//
$tmp  = 'https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/';
$path = PHP_functionToGenerateLatestPath(...);
$file = 'server.zip'l

$url = $tmp .$path .$file

#3

I’d actually use the DOM to query for the correct cell and then get the value of that cell:

<?php

$dom = @DOMDocument::loadHTML(file_get_contents(__DIR__ . '/index.html'));

$xpath = new DOMXPath($dom);

$nodes = $xpath->query('//tr[2]/td[1]');

echo $nodes[0]->nodeValue . 'server.zip';

In the example I’ve downloaded the file and am reading it from disk, but downloading it and parsing that is not that hard. I’ll leave that as an exercise to the reader :wink:

Also, this code represents only happy path, you’d have to add any guards for unexpected stuff (node cannot be found, etc).

1 Like
#4

Standard Disclaimer: Only use these tools on sites you have permission to do so after checking their terms and conditions.

Assuming the structure of the page remains the same, and that the desired link is always the top one, then it would be the 8th <a> tag on the page. Or maybe the 5th. Cant tell if those arrows are separate links or not.

#5

Hi there guys and thanks very much for the help!

I just read back on my OP and realized I didn’t explain myself clearly.

I do not need the zip file itself. I only need to know which version is the latest, which is the highlighted number in the image. I need to get that 1045 into a var.

Would these suggestions still be usable for what I’m trying to do?

#6

take the text content of the link (@rpkamp’s post #3), and instead of substr’ing it (@John_Betong’s post #2), str_split it on the hyphen, and take the first element of the split array.

#7

I got the version!

<?php

$dom = @DOMDocument::loadHTML(file_get_contents(__DIR__ . '/fivem.html'));
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//tr[2]/td[1]');
$version = explode("-", $nodes[0]->nodeValue);
echo $version[0];

?>

Is this an ok way to have handled it or am I overlooking something(s)?

#8

Other than the caution that rpkamp mentioned about ‘unexpected stuff’, looks fine to me.

#9

Looks fine. Just take care that it won’t work when allow_url_fopen is disabled on a host, so it would be more portable if you used something like curl.

#10

I don’t see where allow_url_fopen is being used here, do you mean when I write the portion to retrieve the HTML?

#11

allow_url_fopen is a php.ini setting.

When it’s set to 0 then file_get_contents refuses to fetch URLs.

#12

Ahh, I see, file_get_contents needs to be changed. Will do that and post the updated code.

#13

Is it less likely for a server to have curl disabled than it is allow_url_fopen? I ask because I wonder if it would be worth the effort to have a fallback of trying allow_url_fopen if curl fails or if curl is common / expected.

$ch = curl_init("https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);

$dom = @DOMDocument::loadHTML($content);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//tr[2]/td[1]');
$version = explode("-", $nodes[0]->nodeValue);
echo $version[0];
#14

The only issue I see with this is that you’re relying on the latest one being at the top of the list. Is that something you have control over, or could the server owner change that without warning? I think I’d be tempted to get all the entries and their corresponding dates, then find the latest date, and get the version associated with it.

But of course if you’re sure it will always be at the top, that’s fine.

#15

That looks to be the current default order. Maybe just to make sure the query string bit could be added? i.e.
https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/?C=M&O=D

C, M, O, and D don’t give much idea as to what they are, and I guess those too could be changed at any time. The numbering and dates look to be incremental, maybe putting a “new is greater than last known” check in there somewhere would be a good idea.

1 Like
#16

I was curious and after further investigation discovered:

  1. the link supplied shows a directory listing
    a. Apache creates a web-page with a table of the directory contents
    b. directory contents can be file names, dates, links to sub-directories, etc
    c. Apache uses a style sheet to format the content
    d. Apache lists the contents in a specified order.

I used a couple of browsers to view the directory listing and they were all very similar so…

Using PHP to extract the latest version number after Parent directory/ string followed by href=

<?php 
declare(strict_types=1);

echo '<br><br> line: ',__line__, ' ==> ',
$url = 'https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/';

$result = file_get_contents($url);

$tmp 	= strstr($result, 'Parent directory/');
$tmp 	= htmlspecialchars($tmp);
$tmp 	= strstr( $tmp, 'href=');

echo '<br><br> line: ',__line__, ' ==> ',
$tmp 	= strstr( $tmp, '-', true );

echo '<br><br> line: ',__line__, ' ==> ',
$tmp 	= substr( $tmp, 11  ); // because it is &quoe; and not ""


Output:

#17

Hi there John and thanks so much for the help!

I took your fantastic code and made a function of it. The function works if I comment the strict_types declaration but will not work if it’s included.

I looked up strict_type command and don’t fully understand. it. The function seems to work just fine without it but I wanted to ask it’s purpose and whether it’s truly necessary in my case and if so, what I can do to use it in the function?

function versionGrab($os){

	//declare(strict_types=1);

	if($os == windows){
		$url = 'https://runtime.fivem.net/artifacts/fivem/build_server_windows/master/';
	}else{
		$url = 'https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/';
	}

	$ch = curl_init($url);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
	$content = curl_exec($ch);
	curl_close($ch);

	$tmp 	= strstr($content, 'Parent directory/');
	$tmp 	= htmlspecialchars($tmp);
	$tmp 	= strstr( $tmp, 'href=');
	$link 	= strstr( $tmp, '/', true );
	$link 	= substr( $link, 11  ); // because it is &quoe; and not ""
	$return['link'] = $url.$link;
	$return['version'] = strstr( $link, '-', true );
	
	return $return;
	
}

$latest = versionGrab('windows');

echo '<a href="'.$latest['link'].'/server.zip">'.$latest['version'].'</a>';
1 Like
#18

Try adding these lines tp the file because it looks as though your default PHP.ini file has the default values set to *Off.

Please note errors and warnings should be displayed in the browser rather than being logged in the /log/errors.log file.

<?php 
  declare(strict_types=1); // must be the first declaration
  error_reporting(-1); // maximum error reporting
  ini_set('display_errors', 'true'); // display results to screen instead of errors.log

// remaining script

One reason the script works when declare(…) is removed or commented is that the declaration must be the first declaration in the file. Your PHP.ini file defaults are preventing the errors from showing in the browser which is the online default which could show users sensitive information, passwords, etc

closed #19

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.