SitePoint Sponsor |
|
User Tag List
Results 1 to 12 of 12
-
Mar 24, 2009, 01:40 #1
- Join Date
- Mar 2009
- Posts
- 26
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Need urgent help with html parsing with php
I'm new to PHP developemnt and a complete 'no-good' with regular expressions !!!!
I'm hitting my head on a wall trying to parse a html page.
Any help in this regard would be greatly welcomed.
I need to write a php script which will do this....
Parse any html page line by line.wherever it finds text, it will extract the text and store it in a different variable (array or something) and replace it with a unique token.
say if my html page is something like this
Code:$page_content = "<html> <title> My Page </title> <body> <div> Hello! </div> <div> Its a beautiful world </div> </body> </html>";
it should output to me two things
First the original html but texts replaced with tokens and the array of token=>strings map
Code:$new_page_content = "<html> <title> TOK_TITLE_1 </title> <body> <div> TOK_DIV_1 </div> <div> TOK_DIV_2 </div> </body> </html>"
Code:$token_strings_array = array{ 'TOK_TITLE_1' => "My Page", 'TOK_DIV_1' => "Hello"!, 'TOK_DIV_2' => "Its a beautiful world" }
Is there any standard libraries/ classes ..that I could possible use??
Need help on this asap !!!
-
Mar 24, 2009, 02:34 #2
- Join Date
- Apr 2008
- Location
- North-East, UK.
- Posts
- 6,111
- Mentioned
- 3 Post(s)
- Tagged
- 0 Thread(s)
Not as simple as I first thought, but fun none-the-less.
PHP Code:<?php
$aTokens = array();
$sOriginalHTML = '
<html>
<title>
My Page
</title>
<body>
<div>
Hello!
</div>
<div>
Its a beautiful world
</div>
</body>
</html>
';
$sParsedHTML = preg_replace_callback(
'~(?<=^|>)[^><]+?(?=<|$)~',
create_function(
'$aMatches',
'global $aTokens;
static $iCounter = 0;
if(strlen(trim($aMatches[0])) > 0)
{
$sKey = \'TOKEN_\' . $iCounter++;
$aTokens[$sKey] = trim($aMatches[0]);
return $sKey;
}
return;
'
),
$sOriginalHTML
);
#Tokens
print_r($aTokens);
/*
Array
(
[TOKEN_0] => My Page
[TOKEN_1] => Hello!
[TOKEN_2] => Its a beautiful world
)
*/
#Templated HTML
echo $sParsedHTML;
/*
<html>
<title>
TOKEN_0
</title>
<body>
<div>
TOKEN_1
</div>
<div>
TOKEN_2
</div>
</body>
</html>
*/@AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.
-
Mar 24, 2009, 03:11 #3
- Join Date
- Mar 2009
- Posts
- 26
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Hey...Thanks for your quick reply.I'll check this out and let you know
-
Mar 24, 2009, 05:10 #4
- Join Date
- May 2005
- Location
- UK
- Posts
- 65
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
I just had a look at this and thought that a simple foreach loop replacing the tags in the text with str_replace would do the job? As long as the key in the array is the same as the tag in the HTML then it will work perfectly efficiently.
PHP Code:<?php
$new_page_content = "<html>
<title>
TOK_TITLE_1
</title>
<body>
<div>
TOK_DIV_1
</div>
<div>
TOK_DIV_2
</div>
</body>
</html>";
$token_strings_array = array(
'TOK_TITLE_1' => "My Page",
'TOK_DIV_1' => "Hello",
'TOK_DIV_2' => "Its a beautiful world"
);
foreach ($token_strings_array as $k => $v)
{
$new_page_content = str_replace($k,$v,$new_page_content);
}
echo $new_page_content;
?>
-
Mar 24, 2009, 05:13 #5
- Join Date
- Apr 2008
- Location
- North-East, UK.
- Posts
- 6,111
- Mentioned
- 3 Post(s)
- Tagged
- 0 Thread(s)
The problem lies in the fact he needs to substitute HTML values for tokens first, then replace the tokens.
@AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.
-
Mar 24, 2009, 07:05 #6
- Join Date
- Mar 2009
- Posts
- 26
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
@SilverBulletUK :
You are the man!! script worked exactly as I wanted
@alig4321: thanks pal... would have to do that too eventually.So you really solved my future query...
-
Mar 24, 2009, 07:15 #7
- Join Date
- May 2005
- Location
- UK
- Posts
- 65
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
Mar 24, 2009, 08:45 #8
Another way using DOM, gets real text nodes. Which would most likely support nested elements too.
PHP Code:<?php
$html = '<html><head><title>My Page</title></head><body><div>Hello!</div><div>Its a beautiful world</div></body></html>';
$tokens = array();
header( 'Content-type: text/plain' );
$doc = new DOMDocument();
$doc->loadHTML( $html );
$xp = new DOMXPath( $doc );
$xp = $xp->query( '*//text()' );
foreach ( $xp as $elm ) {
$str = 'TOK_' . strtoupper( $elm->parentNode->nodeName ) . '_';
$int = 0;
while ( isset( $tokens[ $str . ++$int ] ) );
$tokens[ $str . $int ] = $elm->nodeValue;
$elm->replaceData( 0, strlen( $elm->nodeValue ), $str . $int );
}
var_dump( $doc->saveHTML(), $tokens );
-
Mar 24, 2009, 08:51 #9
- Join Date
- Apr 2008
- Location
- North-East, UK.
- Posts
- 6,111
- Mentioned
- 3 Post(s)
- Tagged
- 0 Thread(s)
Nice work Logic, I much prefer yours it's much more concise and its intent it quite clear.
@AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.
-
Mar 24, 2009, 23:07 #10
- Join Date
- Mar 2009
- Posts
- 26
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
Apr 13, 2009, 03:34 #11
- Join Date
- Mar 2009
- Posts
- 26
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
In this piece of regular expressions
'~(?<=^|>)[^><]+?(?=<|$)~'
how can i prevent text between <script ...></script> and <style ...></style> form gettting matched.
I tried to do some modifications with this regex but am not able to achieve thismuch to my frustration
here are my attempts :
(?<=^|>)(?!style$|script$)[^><]+?(?=<|$)
(?<=^|>)[^><(?!style$|script$)]+?(?=<|$)
None of these solves the purpose.
Where can I get a good tutorial for learning to write smart regular expressions and not ask dumd questions ;(
I visted soem sites but end up getting more and more confused .Please help!!
-
Apr 23, 2009, 15:31 #12
- Join Date
- Apr 2009
- Posts
- 8
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Hi asacool:
I had a similar need - extracting plain text from our web pages. (I am a proj mgr - needed to do that for our legal dept.) I started with biterscripting sample script WebPageToText and modified it to suit my requirements. I am not a programmer, but it was easy. Perhaps, you can take the same approach ?
The best way to try that script out, is to download biterscripting - it is free. Follow installation instructions at their web site biterscripting . com . And the script is open source so you can look at the code and modify it as necessary (sounds like you are a software person). They have other sample scripts and documentation on that web site that you may find useful also.
Since my situation was very similar, thought should also make you aware of some other things you may not have considered when extracting plain text from web pages.
- Special character such as
- Code enclosed in {}
- etc
Jenni
Bookmarks