SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Zealot
    Join Date
    Dec 1999
    Location
    Highlands Ranch, CO
    Posts
    193
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Need major help with Reg. Expressions...

    I am trying to build a database of:
    Business Name
    Address
    Phone Number

    I would like to extract this information from another sites html:
    http://www.firstyellow.com/results.a...&txtCat=DIVING

    I am trying to figure out how to extract the info that I want (comma demilited - or whatever) so that I can easily insert the data into my MySQL database.

    Since this site is database driven, all of the listings are in their own table:
    Code:
     
    <table width="100%" border="0" cellspacing="0" cellpadding="0">
    <tr> 
    <td width="78%" height="76" valign="top"> 
    
    <table width="100%" border="0" cellspacing="0" cellpadding="0">
    <tr> 
    <td width="3%" bgcolor="FCD804">&nbsp;</td>
    <td width="67%" bgcolor="#FFFFFF"><span class="Bldarrial"><a href="java script:;" onClick="MM_openBrWindow('bizcard.asp?ID=12769','Bizdetails','width=550,height=450')">
    Abanks Watersports &amp; Tours
    </a></span><br> <span class="txtBlack">Box 31206SMB Sth Church St George Town, <br>
    CAYMAN ISLANDS<br>
    Tel[img]images/smilies/frown.gif[/img]345) 945-1444</span> <br> </td>
    <td width="30%" bgcolor="#FFFFFF"><table width="100%" border="0" cellspacing="3" cellpadding="0">
    <tr> 
    <td width="30%"><div align="center"><a href="java script:;" onClick="MM_openBrWindow('bizcard.asp?ID=12769','Bizdetails','width=510,height=430')"><img src="images/info.gif" border="0"></a></div></td>
    <td width="30%"><div align="center">
    
    <td width="30%"><div align="center">
    
    <tr class="txtBlack"> 
    <td nowrap><div align="center"><a href="java script:;" onClick="MM_openBrWindow('bizcard.asp?ID=12769','Bizdetails','width=510,height=430')">More 
    Info</a></div></td>
    <td><div align="center">
    
    <td><div align="center">
    
    </table></td>
    </tr>
    <tr valign="top"> 
    <td colspan="3"><hr size="1" noshade></td>
    </tr>
    </table>
    
    
    <table width="100%" border="0" cellspacing="0" cellpadding="0">
    <tr> 
    <td width="3%" bgcolor="FCD804">&nbsp;</td>
    <td width="67%" bgcolor="#FFFFFF"><span class="Bldarrial"><a href="java script:;" onClick="MM_openBrWindow('bizcard.asp?ID=12838','Bizdetails','width=550,height=450')">
    Aqua'aire
    </a></span><br> <span class="txtBlack">Box 30147SMB Morgan's Harbour West Bay, <br>
    CAYMAN ISLANDS<br>
    Tel[img]images/smilies/frown.gif[/img]345) 945-1953</span> <br> </td>
    <td width="30%" bgcolor="#FFFFFF"><table width="100%" border="0" cellspacing="3" cellpadding="0">
    <tr> 
    <td width="30%"><div align="center"><a href="java script:;" onClick="MM_openBrWindow('bizcard.asp?ID=12838','Bizdetails','width=510,height=430')"><img src="images/info.gif" border="0"></a></div></td>
    <td width="30%"><div align="center">
    
    <td width="30%"><div align="center">
    
    <tr class="txtBlack"> 
    <td nowrap><div align="center"><a href="java script:;" onClick="MM_openBrWindow('bizcard.asp?ID=12838','Bizdetails','width=510,height=430')">More 
    Info</a></div></td>
    <td><div align="center">
    
    <td><div align="center">
    
    </table></td>
    </tr>
    <tr valign="top"> 
    <td colspan="3"><hr size="1" noshade></td>
    </tr>
    </table>
    Does anyone have Ideas on how I can extract the 'RED' info above and get it into a useable format? I have several hundred pages of data that I need to extract - that follows this template.

    TIA - BigTime [img]images/smilies/smile.gif[/img]

  2. #2
    Web-coding NINJA! silver trophy beetle's Avatar
    Join Date
    Jul 2002
    Location
    Dallas, TX
    Posts
    2,900
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This seems to do this trick

    http://www.peterbailey.net/test/regex.php
    beetle a.k.a. Peter Bailey
    blogs: php | prophp | security | design | zen | software
    refs: dhtml | gecko | prototype | phpdocs | unicode | charsets
    tools: ide | ftp | regex | ffdev




  3. #3
    SitePoint Zealot
    Join Date
    Dec 1999
    Location
    Highlands Ranch, CO
    Posts
    193
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    beetle - that is awesome! Thanks!

    I am only having one problem...

    When I try to get it to parse the whole url (above), it runs into some 'hiccup' after the second entry - but I don't know why.

    This url has your code, but the html is pulled from the URL above:

    http://scubaaddict.com/parse.php

    Any ideas? I can't see any differences between the 3rd entry and any of the other entries (http://www.firstyellow.com/results.a...&txtCat=DIVING)

  4. #4
    SitePoint Zealot
    Join Date
    Dec 1999
    Location
    Highlands Ranch, CO
    Posts
    193
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually - just found it:
    /450'\)\"\>(.*)\<\/a\>.*txtBlack\"\>(.*) \<br\>(.*)\<br\>(.*)\<\/span\>/U

    I removed the space just before: \<br\>
    as it was not in the next result set, and now it works.

    Thanks beetle!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •