SitePoint Sponsor

User Tag List

Results 1 to 8 of 8

Thread: Parse HTML

  1. #1
    SitePoint Evangelist silversurfer5150's Avatar
    Join Date
    Aug 2010
    Posts
    534
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question Parse HTML

    Hi guys,

    I am developing a mobile phone e-commerce store and what I want is to be able to cut and paste a specification from a well-know review site for a phone.

    I have already done the styling for this and it is working great in my product pages, however on other pages I just want to extract certain information from the html below which of course I will already have in the DB from the product page.

    I only need certain things like the CPU, MEMORY etc. but as you can see they are nested in a table with no significant markers to identify one cell from another so I can't do it by class or id.

    Here is the code I will have stored. Can someone tell me the best way to parse this with PHP?

    Thanks in advance

    Code:
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="4" scope="row">
    				General</th>
    			<td class="ttl">
    				<a href="network-bands.php3">2G Network</a></td>
    			<td class="nfo">
    				GSM 850 / 900 / 1800 / 1900</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="network-bands.php3">3G Network</a></td>
    			<td class="nfo">
    				HSDPA 850 / 1900 / 2100 /800</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_year.htm');">Announced</a></td>
    			<td class="nfo">
    				2010, August</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_status.htm');">Status</a></td>
    			<td class="nfo">
    				Available. Released 2010, August</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="3" scope="row">
    				Body</th>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_dimens.htm');">Dimensions</a></td>
    			<td class="nfo">
    				111 x 62 x 14.6 mm</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_weight.htm');">Weight</a></td>
    			<td class="nfo">
    				161 g</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=keyboard">Keyboard</a></td>
    			<td class="nfo">
    				QWERTY</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="4" scope="row">
    				Display</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=display-type">Type</a></td>
    			<td class="nfo">
    				TFT capacitive touchscreen, 16M colors</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_dsize.htm');">Size</a></td>
    			<td class="nfo">
    				360 x 480 pixels, 3.2 inches (~188 ppi pixel density)</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=multitouch">Multitouch</a></td>
    			<td class="nfo">
    				Yes</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				&nbsp;</td>
    			<td class="nfo">
    				- Optical trackpad</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="3" scope="row">
    				Sound</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=call-alerts">Alert types</a></td>
    			<td class="nfo">
    				Vibration, MP3 ringtones</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=loudspeaker">Loudspeaker</a></td>
    			<td class="nfo">
    				Yes</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=audio-jack">3.5mm jack</a></td>
    			<td class="nfo">
    				Yes, <a href="blackberry_torch_9800-review-516p6.php#aq">check quality</a></td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="2" scope="row">
    				Memory</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=memory-card-slot">Card slot</a></td>
    			<td class="nfo">
    				microSD, up to 32GB, 4GB card included</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=dynamic-memory">Internal</a></td>
    			<td class="nfo">
    				4 GB storage, 512 MB RAM, 512 MB ROM</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="8" scope="row">
    				Data</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=gprs">GPRS</a></td>
    			<td class="nfo">
    				Class 10 (4+1/3+2 slots), 32 - 48 kbps</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=edge">EDGE</a></td>
    			<td class="nfo">
    				Class 10, 236.8 kbps</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=3g">Speed</a></td>
    			<td class="nfo">
    				HSDPA; HSUPA</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=wi-fi">WLAN</a></td>
    			<td class="nfo">
    				Wi-Fi 802.11 b/g/n, UMA (carrier-dependent)</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=bluetooth">Bluetooth</a></td>
    			<td class="nfo">
    				Yes, v2.1 with A2DP</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=usb">USB</a></td>
    			<td class="nfo">
    				Yes, microUSB v2.0</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="4" scope="row">
    				Camera</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=camera">Primary</a></td>
    			<td class="nfo">
    				5 MP, 2592х1944 pixels, autofocus, LED flash, <a href="piccmp.php3?idType=1&amp;idPhone1=3203&amp;nSuggest=1">check quality</a></td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=camera">Features</a></td>
    			<td class="nfo">
    				Geo-tagging, continuous auto-focus, image stabilization</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=camera">Video</a></td>
    			<td class="nfo">
    				Yes, VGA@24fps</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=video-call">Secondary</a></td>
    			<td class="nfo">
    				No</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="12" scope="row">
    				Features</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=os">OS</a></td>
    			<td class="nfo">
    				BlackBerry OS 6.0</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=cpu">CPU</a></td>
    			<td class="nfo">
    				624 MHz</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=sensors">Sensors</a></td>
    			<td class="nfo">
    				Proximity</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=messaging">Messaging</a></td>
    			<td class="nfo">
    				SMS, MMS, Email, Push Email, IM</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=browser">Browser</a></td>
    			<td class="nfo">
    				HTML</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=fm-radio">Radio</a></td>
    			<td class="nfo">
    				No</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=gps">GPS</a></td>
    			<td class="nfo">
    				Yes, with A-GPS support</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=java">Java</a></td>
    			<td class="nfo">
    				Yes, MIDP 2.0</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_colors.htm');">Colors</a></td>
    			<td class="nfo">
    				Black, White, Dark Orange</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				&nbsp;</td>
    			<td class="nfo">
    				- Social feeds<br />
    				- BlackBerry maps<br />
    				- Document viewer (Word, Excel, PowerPoint)<br />
    				- Media player MP3/WMA/eAAC+/FlAC/OGG player<br />
    				- Video player DivX/XviD/MP4/WMV/H.263/H.264<br />
    				- Organizer<br />
    				- Voice memo/dial</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="4" scope="row">
    				Battery</th>
    			<td class="ttl">
    				&nbsp;</td>
    			<td class="nfo">
    				Standard battery, Li-Ion 1300 mAh</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=stand-by-time">Stand-by</a></td>
    			<td class="nfo">
    				Up to 432 h (2G) / Up to 336 h (3G)</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=talk-time">Talk time</a></td>
    			<td class="nfo">
    				Up to 5 h 30 min (2G) / Up to 5 h 40 min (3G)</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=music-playback-time">Music play</a></td>
    			<td class="nfo">
    				Up to 30 h</td>
    		</tr>
    	</tbody>
    </table>
    <table cellspacing="0">
    	<tbody>
    		<tr>
    			<th rowspan="3" scope="row">
    				Misc</th>
    			<td class="ttl">
    				<a href="glossary.php3?term=sar">SAR US</a></td>
    			<td class="nfo">
    				0.91 W/kg (head) &nbsp; &nbsp; 0.68 W/kg (body) &nbsp; &nbsp;</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="glossary.php3?term=sar">SAR EU</a></td>
    			<td class="nfo">
    				0.86 W/kg (head) &nbsp; &nbsp; 0.81 W/kg (body) &nbsp; &nbsp;</td>
    		</tr>
    		<tr>
    			<td class="ttl">
    				<a href="#" onclick="helpW('h_price.htm');">Price group</a></td>
    			<td class="nfo">
    				<img src="http://st2.gsmarena.com/vv/price/pg5.gif" title="About 240 EUR" /></td>
    		</tr>
    	</tbody>
    </table>
    "Persistence is the path to perfection"

  2. #2
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  3. #3
    SitePoint Evangelist silversurfer5150's Avatar
    Join Date
    Aug 2010
    Posts
    534
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi there,
    Thanks for this, is this built into php 5.2 + ? Could you give me a simple example of its use, the ones on the manual are difficult to follow.

    Thanks
    "Persistence is the path to perfection"

  4. #4
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,129
    Mentioned
    152 Post(s)
    Tagged
    0 Thread(s)
    See the first comment on that page, it has an HTML example

  5. #5
    SitePoint Evangelist silversurfer5150's Avatar
    Join Date
    Aug 2010
    Posts
    534
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question

    Hi guys, I am trying to parse the string above using simple_html_dom library and I just keep getting call to function on non-object error my code is below:

    PHP Code:
      if(!empty($result['spec']))
                                           {                 
                                           
    $html[$result['product_id']] = str_get_html($result['spec']);
                                           
    $ret[$result['product_id']]  = $html[$result['product_id']]->find('th',0)->innertext;
                                           
    //echo $ret[$result['product_id']];
                                           
    var_dump($ret[$result['product_id']]);
                                           } 
    This works with the code below when I pass in part of the above table as a string so I am guessing it has something to do with the whitespace, tabs, linebreaks etc, is there any way to remove them all and give me the format below?

    PHP Code:
                                           $html[$result['product_id']] = str_get_html('<tr><th rowspan="4" scope="row">General</th><td class="ttl"><a href="network-bands.php3">2G Network</a><a href="network-bands.php3">2G Bogworth</a></td><td class="nfo">GSM 850 / 900 / 1800 / 1900</td></tr>'); 
    "Persistence is the path to perfection"

  6. #6
    SitePoint Evangelist silversurfer5150's Avatar
    Join Date
    Aug 2010
    Posts
    534
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Anyone else who is trying to do this with a string from a database, remember to use htmlspecialchars_decode() otherwise the simple_html_dom script is trying to parse tags as : &lt; p &gt;

    Pretty obvious really but if you're reading because of a call to undefined object error, then there's a good chance you made the same mistake as me
    "Persistence is the path to perfection"

  7. #7
    SitePoint Enthusiast OMGCarlos's Avatar
    Join Date
    Apr 2012
    Location
    Boston, MA
    Posts
    91
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Haha good tip, I do this ALL the time...I'll probably do it in 15 minutes from now too.

  8. #8
    SitePoint Evangelist silversurfer5150's Avatar
    Join Date
    Aug 2010
    Posts
    534
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hahaha,

    Easily done isn't it Carlos, it wasn't until I outputted the string to a file that I realized what form it was being stored in the DB. Again it's one of the drawbacks of developing on top of someone else's CMS rather than customizing with Zend or something similar, you don't know what's going on behind the scenes in your own backyard!
    "Persistence is the path to perfection"


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •