SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Help! I have a problem with doing a little stop words list

    Hi,

    I have searched these forums and found nothing, I have searched through Google and can't find anything to help me. All I want to do is to stop words from showing up that contain 3 letters or less. I also want to do a stopword list on the other words that I don't want in my text and are 4 letters or more.

    How can I do this.

    So far I have the following that I have made, which replaces parts of the text used plus it also takes away all the puncuation and also makes the text all lower case.

    I have also done it so that there is only one of each word listed.

    Now I need to put a stop word list onto this script so that it can also just show the words in the text that are only allowed.

    PHP Code:
    $text "nothing about it most that most don't know which ones don and just discovered normally thinks they can make quick buck two well's not"
    ;

    $desc str_replace(' uk ''uk32n3'$text);

    $desc str_replace('>''> '$desc);

    $desc strip_tags($desc);

    $desc str_replace('http'' '$desc);

    $desc str_replace(' ''  '$desc);

    $desc preg_replace('/[^a-z0-9]/i'' '$desc);

    $desc preg_replace('[A-Z]''[a-z]'$desc);

    $desc strtolower($desc);

    $desc explode(" "$desc);

    $test array_unique($desc);

    foreach(
    $test as $key => $desc){

    echo 
    "$desc ";

    Please help me. Thanks!

  2. #2
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I'm not very good at regex, but you have to add something like {4-20} at the end of your expression to indicate that you only want words that are 4 letters or more (unlikely the word will have more than 20 letters). Just use if() to determine whether it fits your expression (i.e. if it has 4 letters or more).

  3. #3
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That sounds interesting. This maybe something that will work, does anybody know how I can add it to my script above as I know hardly nothing about it either. It took me nearly 8 hours to come up with the above little script.

  4. #4
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I think what you have to do is something like this: (^[a-z0-9]{4,20})$ assuming you're using eregi or something that's case insensitive.

    It's {4,20} with a comma, not a hyphen like I said previously.

    This might help you

  5. #5
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have now tried the following, but am now receiving nothing once all of my text has been passed.

    Can you help. Heres the script:

    PHP Code:
     $text "nothing about it most that most don't know which ones don and just discovered normally thinks they can make quick buck two well's not"
    ;

    $desc str_replace(' uk ''uk32n3'$text);

    $desc str_replace('>''> '$desc);

    $desc strip_tags($desc);

    $desc str_replace('http'' '$desc);

    $desc str_replace(' ''  '$desc);

    $desc eregi("^[a-zA-Z0-9]{4,20}$"" "$desc);

    $desc strtolower($desc);

    $desc explode(" "$desc);

    $test array_unique($desc);

    foreach(
    $test as $key => $desc){

    echo 
    "$desc ";

    Thanks!

  6. #6
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I'm not sure, but don't you have to do something like
    PHP Code:
    if(eregi('^[a-zA-Z0-9]{4,20}$'$desc)) {do something if it passed the regex} else {die} 
    ? I don't think you should have the empty quotes in the middle there.
    eregi doesn't do anything to the string, it just makes a match. But that might not be the problem, it could be the regex itself, but since I don't know much about them, I'm afraid I can't help you much more than this.

  7. #7
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No, it doesn't seem to work. Does anybody know any other ways of doing this.

    Thanks for your help.

  8. #8
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have come up with a solution which seems to work just fine. I have now used the following snippet so that I can check the length of each word and if it has 4 or more characters then it is included.

    The snippet is as follows just incase anybody else may need it:

    PHP Code:
    if(strlen($desc) >= "4"){
    echo 
    "$desc ";

    For the other words that I don't want to include that are 4 or more words, I guess I could just use a string replace on them.

  9. #9
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $stopWords = array(
        
    'most'
    );

    function 
    wordFilter($word)
    {
        global 
    $stopWords;
        return 
    strlen($word) > && !in_array($word$stopWords);
    }

    $text "nothing about it most that most don't know which ones don and just discovered normally thinks they can make quick buck two well's not";

    $words str_word_count(strtolower($text), 1);
    $words array_filter(array_unique($words), 'wordFilter');

    foreach(
    $words as $word)
    {
        echo 
    $word"\n";


  10. #10
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks!

    I have it all sorted now as far as I know. If I have more problems then I will let you know.

  11. #11
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have another problem. I have done the following, but I can't seem to get it to work.

    What I am trying to do is to get text from one table, parse the information using my script and then insert it into another table.

    My script is as follows:

    PHP Code:
    $sql mysql_query("select Descrip, Con, title from articles order by ided asc limit 0, 20");
    $list mysql_num_rows($sql);
    {
    while (
    $i $list) {$lines=mysql_fetch_array($sql);
    $desc $lines["Con"]." ".$lines["title"]." ".$lines["Descrip"];
    $desc str_replace(' uk ''uk32n3'$desc);
    $desc str_replace('>''> '$desc);
    $desc strip_tags($desc);
    $desc str_replace('http'' '$desc);
    $desc str_replace(' accordingly '' '$desc);
    $desc str_replace(' again '' '$desc);
    $desc str_replace(' allows '' '$desc);
    $desc str_replace(' also '' '$desc);
    $desc str_replace(' with '' '$desc);
    $desc str_replace(' would '' '$desc);
    $desc str_replace(' your '' '$desc);
    $desc str_replace(' ''  '$desc);
    $desc strtolower($desc);
    $desc preg_replace('/[^a-z0-9]/i'' '$desc);
    $desc explode(" "$desc);

    $test array_unique($desc);

    foreach(
    $test as $key => $desc){

    if(
    strlen($desc) >= "4"){

    $intq "INSERT INTO artdatabase ('description') VALUES ('$desc')";
    echo 
    "$intq";
    }
    }

    $i++;}} 
    How can I do this, my script above doesn't seem to work properly.

    The script works fine, but when it comes to the insert section of the script it is producing duplicate inserts for each word of a row and not one insert for all of the words for the row.

    How can I join all of the words together so that it produces a row again and then insert the new row into my database. Once this has been done it will then move onto the next row.

    Thanks!

  12. #12
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    Thanks for the script that you had given me. I am now trying it and it seems to work. The only problem is that I now want to insert the parsed information into mysql using an insert. How can I do that. I have tried it and it is doing many inserts with only one of the words in each value. I just want one insert with on value for the whole of the information.

    I think it is because I have the insert in the loop, but how can I join all of the words from the loop up and then carry it to the insert statement.

    Please help.

    Here is the script that I have so far:

    PHP Code:
     $stopWords = array(
        
    'most','about'
    );

    function 
    wordFilter($word)
    {
        global 
    $stopWords;
        return 
    strlen($word) > && !in_array($word$stopWords);
    }

    $text "nothing about it most that most don't know which ones don and just discovered normally thinks they can make quick buck two well's not";
    $desc str_replace(' ''  '$text);
    $desc strtolower($desc);
    $desc preg_replace('/[^a-z0-9]/i'' '$desc);
    $desc explode(" "$desc);

    $test array_unique($desc);

    $test implode(" "$test);

    $words str_word_count(strtolower($test), 1);
    $words array_filter(array_unique($words), 'wordFilter');

    foreach(
    $words as $word)
    {
        
    $intq "INSERT INTO artd ('description') VALUES ('$word')";
    echo 
    "$intq";





    Quote Originally Posted by Ren
    PHP Code:
    $stopWords = array(
        
    'most'
    );

    function 
    wordFilter($word)
    {
        global 
    $stopWords;
        return 
    strlen($word) > && !in_array($word$stopWords);
    }

    $text "nothing about it most that most don't know which ones don and just discovered normally thinks they can make quick buck two well's not";

    $words str_word_count(strtolower($text), 1);
    $words array_filter(array_unique($words), 'wordFilter');

    foreach(
    $words as $word)
    {
        echo 
    $word"\n";


    Thanks!

  13. #13
    SitePoint Wizard
    Join Date
    Nov 2003
    Location
    United Kingdom
    Posts
    2,120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    Thanks for all your help. After hours of doing all this I have now sorted it out.

    Thanks!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •