SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Guru Husain's Avatar
    Join Date
    Sep 2001
    Posts
    620
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Build keywords from marked-up text

    I am storing articles in a database that are marked up with HTML. To facilitate searching of articles I have another column that contains keywords. I am simply building the keyword list by stripping off all HTML, punctuations and words that are less than 3 characters long.

    This stripping business is where I am stuck

    This is what I have at the moment. Any help will be greatly appreciated.

    PHP Code:
    <?php

    $string 
    'Replacement may &copy;contain references of the form <em>\\n</em> or (since PHP 4.0.4) <strong>$n</strong>, with the latter form being the preferred one.   Every such reference will be replaced by the text captured by the n\'th parenthesized pattern.

        <em>n</em> can be from 0 to 99 &mdash; and \\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern.'
    ;

    // ---------------------------------------------------------------------

    $search = array(
        
    "'\n'",
        
    "'\t'",
        
    "/\s\s+/",
    );

    $replace = array(
        
    "",
        
    "",
        
    " ",
        
    " ",
    );

    $keywords strip_tags($string);
    $keywords preg_replace($search$replace$keywords);
    //$keywords = preg_replace("[\D]", '', $keywords);

    $keywords preg_replace("/(.*?)([A-Za-z0-9\s]*)(.*?)/""$2"$keywords);

    echo(
    "<textarea rows=\"10\" cols=\"20\" style=\"width: 90%\">$string</textarea>");
    echo(
    "<textarea rows=\"10\" cols=\"20\" style=\"width: 90%\">$keywords</textarea>");

    ?>
    The output I am expecting would look something like this:

    Replacement may contain references the form since PHP 404 with the latter form being the preferred one Every such reference will replaced the text captured the nth parenthesized pattern can from 0 99 and refers the text matched the whole pattern Opening parentheses are counted from left right starting from 1 obtain the number the capturing subpattern

  2. #2
    SitePoint Guru dbevfat's Avatar
    Join Date
    Dec 2004
    Location
    ljubljana, slovenia
    Posts
    684
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    **DELETED**

    suggested strip_tags() and obviously haven't even read the post ...

  3. #3
    SitePoint Guru Husain's Avatar
    Join Date
    Sep 2001
    Posts
    620
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    An update to the code I gave in the initial example:
    PHP Code:
    <?php

    $string 
    'Replacement may &copy;contain references of the form <em>\\n</em> or (since PHP 4.0.4) <strong>$n</strong>, with the latter form being the preferred one.   Every such reference will be replaced by the text captured by the n\'th parenthesized pattern.

        <em>n</em> can be from 0 to 99 &mdash; and \\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern.
        
        <div align="center">
           <img src="/images/fashion/yda2005/image0004.jpg" width="199" height="300" alt="" class="imageborder" />
           <img src="/images/fashion/yda2005/image0005.jpg" width="199" height="300" alt="" class="imageborder" />
        </div>'
    ;

    // ---------------------------------------------------------------------



    $search = array(
        
    "'\n'",
        
    "'\t'",
        
    "/\s\s+/",
    );

    $replace = array(
        
    ""/* \n */
        
    ""/* \t */
        
    " "/* spaces */
    );

    $keywords strtolower($string);
    $keywords strip_tags($keywords);

    $keywords preg_replace("/\&(.*?)\;/"""$keywords);

    $keywords preg_replace("/(.*?)([A-Za-z0-9\s]*)(.*?)/""$2"$keywords);
    $keywords preg_replace($search$replace$keywords);

    echo(
    "<textarea rows=\"10\" cols=\"20\" style=\"width: 90%\">$string</textarea>");
    echo(
    "<textarea rows=\"10\" cols=\"20\" style=\"width: 90%\">$keywords</textarea>");

    ?>
    This code removes all HTML markup and punctuations. What I cannot figure out is how to remove words that are less than 3 characters.

  4. #4
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Split the string by the white space characters into Array. Then test foreach and if any of the String is less than three characters remove it.

  5. #5
    SitePoint Guru Husain's Avatar
    Join Date
    Sep 2001
    Posts
    620
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by REMIYA
    Split the string by the white space characters into Array. Then test foreach and if any of the String is less than three characters remove it.
    Is there a regular expression that gives the same result? And what would be faster: a regular expression or array looping? (on average articles are around 1000 words - without HTML markup).


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •