SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    SitePoint Enthusiast
    Join Date
    Mar 2009
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Need urgent help with html parsing with php

    I'm new to PHP developemnt and a complete 'no-good' with regular expressions !!!!

    I'm hitting my head on a wall trying to parse a html page.
    Any help in this regard would be greatly welcomed.

    I need to write a php script which will do this....

    Parse any html page line by line.wherever it finds text, it will extract the text and store it in a different variable (array or something) and replace it with a unique token.

    say if my html page is something like this

    Code:
    $page_content = "<html>
    <title>
     My Page
    </title>
    <body>
      <div>
        Hello!
      </div>
      <div>
        Its a beautiful world
      </div>
    </body>
    </html>";

    it should output to me two things
    First the original html but texts replaced with tokens and the array of token=>strings map


    Code:
    $new_page_content = "<html>
    <title>
     TOK_TITLE_1
    </title>
    <body>
      <div>
        TOK_DIV_1
      </div>
      <div>
        TOK_DIV_2
      </div>
    </body>
    </html>"
    Code:
    $token_strings_array = array{
    'TOK_TITLE_1' => "My Page",
    'TOK_DIV_1' => "Hello"!,
    'TOK_DIV_2' => "Its a beautiful world"
    }
    What could be the best way to do this.

    Is there any standard libraries/ classes ..that I could possible use??

    Need help on this asap !!!

  2. #2
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Not as simple as I first thought, but fun none-the-less.

    PHP Code:
    <?php
    $aTokens 
    = array();

    $sOriginalHTML '
    <html>
        <title>
            My Page
        </title>
        <body>
            <div>
                Hello!
            </div>
            <div>
                Its a beautiful world
            </div>
        </body>
    </html>
    '
    ;

    $sParsedHTML preg_replace_callback(
        
    '~(?<=^|>)[^><]+?(?=<|$)~',
        
    create_function(
            
    '$aMatches',
            
    'global $aTokens;
            static $iCounter = 0;
            if(strlen(trim($aMatches[0])) > 0)
            {
                $sKey = \'TOKEN_\' . $iCounter++;
                $aTokens[$sKey] = trim($aMatches[0]);
                return $sKey;
            }
            return;
            '
        
    ),
        
    $sOriginalHTML
    );

    #Tokens
    print_r($aTokens);
    /*
    Array
    (
        [TOKEN_0] => My Page
        [TOKEN_1] => Hello!
        [TOKEN_2] => Its a beautiful world    
    )
    */

    #Templated HTML
    echo $sParsedHTML;
    /*
    <html>
        <title>
            TOKEN_0
        </title>
        <body>
            <div>
                TOKEN_1
            </div>
            <div>
                TOKEN_2
            </div>
        </body>
    </html>
    */
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  3. #3
    SitePoint Enthusiast
    Join Date
    Mar 2009
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hey...Thanks for your quick reply.I'll check this out and let you know

  4. #4
    SitePoint Enthusiast
    Join Date
    May 2005
    Location
    UK
    Posts
    65
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I just had a look at this and thought that a simple foreach loop replacing the tags in the text with str_replace would do the job? As long as the key in the array is the same as the tag in the HTML then it will work perfectly efficiently.

    PHP Code:
    <?php 
    $new_page_content 
    "<html>
    <title>
     TOK_TITLE_1
    </title>
    <body>
      <div>
        TOK_DIV_1
      </div>
      <div>
        TOK_DIV_2
      </div>
    </body>
    </html>"
    ;

    $token_strings_array = array(
    'TOK_TITLE_1' => "My Page",
    'TOK_DIV_1' => "Hello",
    'TOK_DIV_2' => "Its a beautiful world"
    );

    foreach (
    $token_strings_array as $k => $v)
    {
    $new_page_content str_replace($k,$v,$new_page_content);
    }

    echo 
    $new_page_content;

    ?>

  5. #5
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    The problem lies in the fact he needs to substitute HTML values for tokens first, then replace the tokens.

    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  6. #6
    SitePoint Enthusiast
    Join Date
    Mar 2009
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    @SilverBulletUK :

    You are the man!! script worked exactly as I wanted


    @alig4321: thanks pal... would have to do that too eventually.So you really solved my future query...

  7. #7
    SitePoint Enthusiast
    Join Date
    May 2005
    Location
    UK
    Posts
    65
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Thumbs up

    Quote Originally Posted by SilverBulletUK View Post
    The problem lies in the fact he needs to substitute HTML values for tokens first, then replace the tokens.

    Too true I got ahead of myself and assumed a database structure where the tag and content existed and could be extracted easily into the array and then the foreach loop run. In the above case, your solution was just right

  8. #8
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    Another way using DOM, gets real text nodes. Which would most likely support nested elements too.
    PHP Code:
    <?php

    $html 
    '<html><head><title>My Page</title></head><body><div>Hello!</div><div>Its a beautiful world</div></body></html>';
    $tokens = array();

    header'Content-type: text/plain' );

    $doc = new DOMDocument();
    $doc->loadHTML$html );

    $xp = new DOMXPath$doc );
    $xp $xp->query'*//text()' );

    foreach ( 
    $xp as $elm ) {
        
    $str 'TOK_' strtoupper$elm->parentNode->nodeName ) . '_';
        
    $int 0;

        while ( isset( 
    $tokens$str . ++$int ] ) );

        
    $tokens$str $int ] = $elm->nodeValue;
        
    $elm->replaceData0strlen$elm->nodeValue ), $str $int );
    }

    var_dump$doc->saveHTML(), $tokens );
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  9. #9
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Nice work Logic, I much prefer yours it's much more concise and its intent it quite clear.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  10. #10
    SitePoint Enthusiast
    Join Date
    Mar 2009
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by logic_earth View Post
    Another way using DOM, gets real text nodes. Which would most likely support nested elements too.
    PHP Code:
    <?php

    $html 
    '<html><head><title>My Page</title></head><body><div>Hello!</div><div>Its a beautiful world</div></body></html>';
    $tokens = array();

    header'Content-type: text/plain' );

    $doc = new DOMDocument();
    $doc->loadHTML$html );

    $xp = new DOMXPath$doc );
    $xp $xp->query'*//text()' );

    foreach ( 
    $xp as $elm ) {
        
    $str 'TOK_' strtoupper$elm->parentNode->nodeName ) . '_';
        
    $int 0;

        while ( isset( 
    $tokens$str . ++$int ] ) );

        
    $tokens$str $int ] = $elm->nodeValue;
        
    $elm->replaceData0strlen$elm->nodeValue ), $str $int );
    }

    var_dump$doc->saveHTML(), $tokens );

    Hey thanks!! Gotta give this a try ...The html that I need to parse would be dynamic...So I'll checkout with both these scripts ..

  11. #11
    SitePoint Enthusiast
    Join Date
    Mar 2009
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In this piece of regular expressions

    '~(?<=^|>)[^><]+?(?=<|$)~'

    how can i prevent text between <script ...></script> and <style ...></style> form gettting matched.


    I tried to do some modifications with this regex but am not able to achieve this much to my frustration

    here are my attempts :

    (?<=^|>)(?!style$|script$)[^><]+?(?=<|$)

    (?<=^|>)[^><(?!style$|script$)]+?(?=<|$)


    None of these solves the purpose.

    Where can I get a good tutorial for learning to write smart regular expressions and not ask dumd questions ;(

    I visted soem sites but end up getting more and more confused .Please help!!

  12. #12
    SitePoint Member
    Join Date
    Apr 2009
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi asacool:

    I had a similar need - extracting plain text from our web pages. (I am a proj mgr - needed to do that for our legal dept.) I started with biterscripting sample script WebPageToText and modified it to suit my requirements. I am not a programmer, but it was easy. Perhaps, you can take the same approach ?

    The best way to try that script out, is to download biterscripting - it is free. Follow installation instructions at their web site biterscripting . com . And the script is open source so you can look at the code and modify it as necessary (sounds like you are a software person). They have other sample scripts and documentation on that web site that you may find useful also.

    Since my situation was very similar, thought should also make you aware of some other things you may not have considered when extracting plain text from web pages.

    • Special character such as &nbsp;
    • Code enclosed in {}
    • etc


    Jenni


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •