SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Member
    Join Date
    Aug 2010
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Scheduling a screen scrape. . .

    Greetings all, and thanks in advance to anyone taking the time to read this. . .double thanks if you post a reply.

    That being said, I'm having to screen scrape a website that is linked to my uncle's business' inventory database. On his normal website, it checks in w/ the server at his business every night and updates product prices on a nightly basis. The people that setup his business software also setup his website. So I don't want to mess w/ anything there. Problem is, his website sucks.

    So I've made an alternate site he hopes to use that looks great, and I'm using PHP w/ CURL to scrape his website. My site basically works like this:
    1) Visitor clicks link to a catalog section
    2) If they're the first visitor that day to visit that page, then the screen scape is performed to get that pages info from the other site, and is then cached and served to all other visitors that day. This happens every day and is how I'm able to keep my site updated and in sync w/ his other website and business.

    The problem is, it takes forever if you're the first person to click that link that day. 20-30 seconds sometimes. After it's cached, it's fast.

    So, I'm wondering how I would go about automating the first click on every catalog link to occur at 7am in the morning. That way, the site is current, updated, and cached before the first visitor even comes that day, drastically increasing my load time. The main reason I need to do this is b/c I think Google is penalizing the site b/c it takes so long to load.

    I've thought a cron job might do it, but I don't know anything about them really, so I have no idea how to set it up.

    Any help would be greatly appreciated!
    Thanks in advance!

  2. #2
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,747
    Mentioned
    64 Post(s)
    Tagged
    0 Thread(s)
    Cron is simple enough, but why?

    Do you not have access to the database powering the existing site?
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.

  3. #3
    SitePoint Member
    Join Date
    Aug 2010
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    No Access. . .

    Unfortunately, I don't have access to the main DB. And it's linked up to the business software and running ASP, so there's several things going on there that I don't want to disturb. Plus the owner of the main site will not let me have access to the DB. So alas, after thinking long and hard on how to do it, my meager skills came to the conclusion that I'd have to go east to get west in this case.
    I figured cron would be the way to go, but I've never used it and I'm unfamiliar w/ it's syntax. Everything I have setup so far is working beautifully. I just need to automate that first click so everything gets cached for the rest of the day and is up to date.
    What would I need to do there? Again, I'm barely above nube when it comes to PHP, but I've got this far and feel good about the results. In my mind, I'm thinking I'll need a php file, or even a webbot maybe, to pose as a browser and every morning at 7 access all the links on my site, thereby making all the other dominos I've got setup fall.
    Could you give a cron example?

  4. #4
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,747
    Mentioned
    64 Post(s)
    Tagged
    0 Thread(s)
    SSH in.
    Step 1: Make sure you can run php files via CLI.
    Input: php /absolute/path/to/your/php/script.php <enter> (note: this is NOT a URL.)
    If the page runs (IE: your OS doesnt tell you "I dont know what 'php' is" or "you dont have permission to run that file"), move on to Step 2.

    Step 2:
    Decide how often you want the script to run. In your case, you said once a day, at 7 AM Server Time. (Note: You may have to do some math to figure out when '7 AM' is, relative to your server.)
    Input: crontab -e <enter> (Enters crontab in Edit mode. Creates a crontab if one does not exist.)
    (At this point, you're in vi, with an empty file. Maybe with a header row, if your distro is nice to you. If it does put a header row in, make sure you start the input on a blank line.)
    Input: i (this puts you in Insert mode)
    Input: 0 7 * * * php /absolute/path/to/your/script/again.php (This says: On the 0th minute of the 7th hour [which is your 7:00] of every day of the month of every month of the year and every day of the week [dont ask], execute what comes next.)
    Input: <escape>:wq (yes, type the colon. The escape gets you OUT of Insert mode. The colon is "the following is a command" and WQ = "Write and Quit")
    At this point the system should return you to the command prompt, and tell you that it's installed the new crontab. You're done.
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •