PHP or JS - Get dynamically generated data from other domain

ennaido · September 14, 2020, 7:47am

Hi,

There’s a site which updates a number on a daily basis and I want to grab that number via a cron script that will run once a day. That site’s content is dynamically produced via JS - no HTML of the element of that number is visible via “Ctrl + U”.

Is it possible to somehow grab that number using PHP and or JS?

I know file_get_contents(), but it does not work in this case.

Thanks.

James_Hibbard · September 14, 2020, 8:12am

You can scrape dynamically generated content with puppeteer.

https://manuelhans.com/blog/2020/01/17/scraping-a-dynamic-web-page-using-puppeteer/

It should also be possible to run that via a cron job, but be aware that it requires a Node.js runtime.

droopsnoot · September 14, 2020, 8:19am

Doesn’t that site provide an API so that you can access the data without having to scrape the site? I just wonder if part of the reason that they generate the site in that way is to make it difficult for people to scrape it.

James_Hibbard · September 14, 2020, 8:20am

It’s probably just using some kind of JS framework.

ennaido · September 14, 2020, 8:24am

They are relatively new, and have no API yet. What I need to get is a 4-digit number, once a day. My script will be no different than me manually accessing the site and writing down the number into my script.

I guess I can manually do it for the time being.

ennaido · September 14, 2020, 8:25am

Thank you for the suggestion. I will take a look into that, but I mostly prefer not to use any libraries or extra scripts, unless it is totally impossible otherwise. E.g. I’m wondering if it is possible via Ajax…

James_Hibbard · September 14, 2020, 8:43am

You’ll run into CORS issues, most likely.

John_Betong · September 14, 2020, 11:40am

Try PHP curl and PHP wget. Both should get the complete web page, I’m not sure about the “dynamically produced JavaScript”.

m_hutley · September 14, 2020, 11:47am

A site that loads its content via Javascript will not magically produce the content via curl, or wget, or anything else - the Javascript has to be run by a browser in order to result in the output.

That said, the site seems to be going through a lot of steps to prevent its content from being read. What site is this, and have you read their Terms of Service?

ennaido · September 14, 2020, 12:44pm

Thank you for your insights. The site has no TOS or the like. I even contacted them asking for a possible API. Here is the link actually:

https://www.ampleforth.org/dashboard/

They use JS to display their data. I need to get, once a day at the same time, Oracle Rate and Price Target values.

I’m doing it manually now, and if I manage to do it with cron or something, it shall be no different than manually doing it - I mean no extra load on that site or any kind of scraping extra pages. Just one single hit request per day.

James_Hibbard · September 14, 2020, 12:56pm

It’s a React app which is just pulling in some content dynamically. I don’t think they’re trying to prevent anything.

ahundiak · September 14, 2020, 1:03pm

What you could do is bring up the site in a browser, press F12 and look at the network tab. Refresh the browser and see what sort of request is being triggered by the javascript to get your number. Duplicating the request itself with curl should not be difficult though if they have any sort of security on it then you might still hit a roadblock.

ennaido · September 15, 2020, 5:21am

The owner of the site replied, with a link to their simple API, which is what I needed actually. I was not able to see it on their GitHub earlier, weird thing. Here’s the link, in case anyone may need some time:

https://github.com/ampleforth/Ampleforth-Wiki/wiki/Ampleforth-API

system · December 15, 2020, 12:21pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.