SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    googlicious graymatter bvarvel's Avatar
    Join Date
    Sep 2002
    Location
    Katy, TX
    Posts
    952
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How to Scrape a File

    Let's say I have a text file that contains the following:

    Addr: 1014 SIERRA SHADOWS DRCity: KATY Zip: 77450-3815
    Sub: SILVERSTONE 1Year Built: 1979
    There could be literally hundreds of these entries per text file and I need to run through the file and extract the information I need. The file would be pretty much in that format in a .txt file (or maybe htm).

    Can someone give me an example of a php script to could 'scrape' this file and return the information I need in the following format: (CSV) or point me in the right direction.

    1014 SIERRA SHADOWS DR, KATY,77450,SILVERSTONE

  2. #2
    SitePoint Wizard swdev's Avatar
    Join Date
    Oct 2004
    Location
    UK
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    try this regular expression

    PHP Code:
     $lines[] = 'Addr: 1014 SIERRA SHADOWS DRCity: KATY Zip: 77450-3815
     Sub: SILVERSTONE 1Year Built: 1979'
    ;
     
     
    $regexp '~addr:|city:|zip:|sub:|1year built:~mi';
     
     foreach (
    $lines as $line)
     {
       
    $arr preg_split($regexp$line, -1PREG_SPLIT_NO_EMPTY);
       echo 
    $arr[0] . ' , ' $arr[1] . ' , ' $arr[2] . ' , ' $arr[3] . '<br>' ;
     } 

  3. #3
    googlicious graymatter bvarvel's Avatar
    Join Date
    Sep 2002
    Location
    Katy, TX
    Posts
    952
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I rewrote that expression like so:

    PHP Code:
      foreach ($lines as $line)
      {
         
    $arr preg_split($regexp$line, -1PREG_SPLIT_NO_EMPTY);
         echo 
    trim($arr[0]) . ',' trim($arr[1]) . ',' trim($arr[2]) . ',' trim($arr[3]) . '<br>' ;
      } 
    That way if any white space is in there (which it was) it's removed from the file. So thanks!!!

    The only remaining part of my problem is this: I need to load the data from a text file, not have it included in a string. Any thoughts on that?

  4. #4
    SitePoint Wizard swdev's Avatar
    Join Date
    Oct 2004
    Location
    UK
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    My pleasure.
    What I suggested was very myuch an example, just to show where the various matches ended up in the array.
    Using trim is a good idea to remove leading / traing spaces.
    It is a shame that you didn't want all of the matched elements, because then you could have used implode

  5. #5
    googlicious graymatter bvarvel's Avatar
    Join Date
    Sep 2002
    Location
    Katy, TX
    Posts
    952
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've been searching for the last couple of hours and I still can't seem to find a way to read the data in from a file. Does anyone have any other ideas on how to accomplish this?

  6. #6
    SitePoint Guru
    Join Date
    Nov 2004
    Location
    Plano
    Posts
    643
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    fopen() if the file is not on ur server

    otherwise (if im not mistaken) u can just use include();

  7. #7
    SitePoint Wizard swdev's Avatar
    Join Date
    Oct 2004
    Location
    UK
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  8. #8
    googlicious graymatter bvarvel's Avatar
    Join Date
    Sep 2002
    Location
    Katy, TX
    Posts
    952
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think the problem may be with the expression. I guess I should have mentioned that the quote above is not the only information in the file.

    What i quoted above was this:

    Addr: 1014 SIERRA SHADOWS DRCity: KATY Zip: 77450-3815
    Sub: SILVERSTONE 1Year Built: 1979
    When in fact, the file might look more like this (including the extraneous data):

    No Photo Short ReportSingle-FamilyML #: 2276317Status: X LP: $78,000
    SP: $
    County: HARRISKM: 485L Area: 36 - South Katy Area LP/SF: $ 67.94
    Addr: 1014 SIERRA SHADOWS DRCity: KATY Zip: 77450-3815
    Sub: SILVERSTONE 1Year Built: 1979/Appraisal District
    SqFt: 1148/Appraisal DistrictBedrooms: 3 FB/HB: 2/0
    Style: Traditional Lot Size: 5617/Appraisal District DOM: 185
    Garage: 0/ Stories: 1
    List Firm: COLD20/Coldwell Banker United, Office #: (281) 579-2300Appt #:
    (281)579-2300/Office
    Dir: I-10 to south on Mason Rd., right on Rock Canyon, right on Sierra
    Shadows.
    Property Description - Public: Lowest price in KISD - south of I-10 (July
    15,2004). Cute two or three bedroom home has vinyl siding, large den with
    fireplace, and nice formal dining room. The third bedroom is part of a
    garage conversion. Home could be converted back to two bedroom with a
    garage. Above ground pool and storage shed in back. The a/c outside as
    well as furnace inside were replaced in 2002, per seller, as was much of
    the fence. Home also has a great covered deck!
    What I need to do is filter out only the data I need, then put it into a comma seperated file.

    I looked through the functions you mentioned and can't seem to get a firm grasp on them. Any help is GREATLY appreciated.

  9. #9
    SitePoint Wizard swdev's Avatar
    Join Date
    Oct 2004
    Location
    UK
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Is all this data on a single line, or on multiple lines?
    Are there multiple occurences of this data in the file?

  10. #10
    googlicious graymatter bvarvel's Avatar
    Join Date
    Sep 2002
    Location
    Katy, TX
    Posts
    952
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This data is on multiple lines, and yes.. there could be potentially hundreds of reoccurences. What I'm doing is dumping data from an MLS system and I want to scrub just the addresses from the rest of the data. I then want it outputted to a CSV file so I can import the address into a contact management system.

    THANKS SO MUCH FOR ALL THE HELP YOU"VE PROVIDED SO FAR! Any more thoughts?

  11. #11
    SitePoint Zealot
    Join Date
    Aug 2004
    Location
    Madison, WI
    Posts
    191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by XtrEM3
    fopen() if the file is not on ur server

    otherwise (if im not mistaken) u can just use include();
    you are mistaken include() will only dump the contents of the included file into the current file. it doesn't associate that file with a variable or anything else that could be worked with. the functions that swdev mentions will allow you to actually do something with the content of the other file.

    by the way, i have nothing to actually contribute to the current discussion

  12. #12
    SitePoint Wizard swdev's Avatar
    Join Date
    Oct 2004
    Location
    UK
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Me thinks you'll need the services of a RegEx expert and I'm definitley not one of those

    An alternative - is it possible that the output from the MTS system could be changed to output just the address stuff, or to separate each colum with some character?

    Sorry I can't be more help


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •