SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Wizard Dean C's Avatar
    Join Date
    Mar 2003
    Location
    England, UK
    Posts
    2,906
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Regex to extract whole sentences that contain a certain word

    As per the title this is my failed attempt.

    Target string:
    Start of sentence one. This is a wordmatch one two three four. Another, sentence here.
    Regular expression:
    Code:
    \b[A-Z].*?(wordmatch).*?\b
    Expected match:
    Code:
    This is a wordmatch one two three four
    Actual match:
    Code:
    Start of sentence one. This is a wordmatch
    I'm a bit stumped on this one. Is there a nice punctuation escape character I'm missing out on I understand why the word boundary won't work, but am I really going to have to create a character class to try and guess punctuation rules?

  2. #2
    messing with my mind fristi's Avatar
    Join Date
    Feb 2009
    Posts
    292
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dean C View Post
    As per the title this is my failed attempt.

    Target string:


    Regular expression:
    Code:
    \b[A-Z].*?(wordmatch).*?\b
    Expected match:
    Code:
    This is a wordmatch one two three four
    Actual match:
    Code:
    Start of sentence one. This is a wordmatch
    I'm a bit stumped on this one. Is there a nice punctuation escape character I'm missing out on I understand why the word boundary won't work, but am I really going to have to create a character class to try and guess punctuation rules?


    You have to imagine that you are a pc. What defines a sentence for a machine?
    A machine won't analyze the context of a sentence to find out if it could be a sentence or not. So we need boundaries... The classic boundary for a sentence are the punctuations... so unfortunately you will have to go down that road...

    PHP Code:
    $str 'Start of sentence one. This is a wordmatch one two three four. Another, sentence here.';
    $regex '/[A-Z][^\.;]*(wordmatch)[^\.;]*/';

    if (
    preg_match($regex$str$match))
        echo 
    $match[0]; 
    To PHP or to Perl, that is the question!
    (Bucket - simpletest) User

  3. #3
    SitePoint Wizard Dean C's Avatar
    Join Date
    Mar 2003
    Location
    England, UK
    Posts
    2,906
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks Fristi. Now try this:

    PHP Code:
    $string 'Having using Kaspersky Antivirus in the past, and been highly impressed, I found myself looking for a new antivirus for a freshly built PC. I had been using AVG 7.5 for the last year, and after becoming fed up of being nagged to use the paid version of AVG8 I decided to try the latest offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.';
    $regex '/[A-Z][^\.;]*(virus)[^\.;]*/';

    if (
    preg_match_all($regex$str$match)) 
        
    print_r($match); 
    See how it's incorrectly matching the second sentence.

  4. #4
    messing with my mind fristi's Avatar
    Join Date
    Feb 2009
    Posts
    292
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dean C View Post
    Thanks Fristi. Now try this:

    PHP Code:
    $string 'Having using Kaspersky Antivirus in the past, and been highly impressed, I found myself looking for a new antivirus for a freshly built PC. I had been using AVG 7.5 for the last year, and after becoming fed up of being nagged to use the paid version of AVG8 I decided to try the latest offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.';
    $regex '/[A-Z][^\.;]*(virus)[^\.;]*/';

    if (
    preg_match_all($regex$str$match)) 
        
    print_r($match); 
    See how it's incorrectly matching the second sentence.

    Do you want the exact word virus matched or any word that contains the word virus?

    What do you want the regex to match in this example?
    To PHP or to Perl, that is the question!
    (Bucket - simpletest) User

  5. #5
    SitePoint Wizard Dean C's Avatar
    Join Date
    Mar 2003
    Location
    England, UK
    Posts
    2,906
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi fristi,

    I want it to match any whole sentence that begins with, ends with or contains a string. In this case the string is virus. So it should match any whole sentence that contains the word virus. In this case it will be:

    Having using Kaspersky Antivirus in the past, and been highly impressed, I found myself looking for a new antivirus for a freshly built PC
    I had been using AVG 7.5 for the last year, and after becoming fed up of being nagged to use the paid version of AVG8 I decided to try the latest offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software
    I.e. both sentences because they contain the word virus

  6. #6
    messing with my mind fristi's Avatar
    Join Date
    Feb 2009
    Posts
    292
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dean C View Post
    Hi fristi,

    I want it to match any whole sentence that begins with, ends with or contains a string. In this case the string is virus. So it should match any whole sentence that contains the word virus. In this case it will be:


    I.e. both sentences because they contain the word virus
    The problem with the second on, is that it it contains 7.5 so the parser ends the previous sentence at 7 and he can't start the next one because it is a 5 instead of a captivate Letter. This is a tricky one, since a . doesn't mean a sentence boundary.

    I don't know if it can be done, I'll look into it.
    To PHP or to Perl, that is the question!
    (Bucket - simpletest) User

  7. #7
    SitePoint Wizard Dean C's Avatar
    Join Date
    Mar 2003
    Location
    England, UK
    Posts
    2,906
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just to give you a bit of background on what I'm trying to do, so you don't think it's a fruitless exercise. I'm doing a MySQL fulltext search, but I want to show a helpful search snippet in my results preferably highlighting the match in bold like Google do It'd be easy to find the match using strpos, and then just pick 50 chars either side, but I want the search snippet to have some context. If you look at google, their snippets all start with the beginning of a sentence, rather than mid-sentence. They also don't cut off parts of any words when they cut off the snippet.

    Therefore, this is what I'm attempting to do and this regex is the start of it

  8. #8
    messing with my mind fristi's Avatar
    Join Date
    Feb 2009
    Posts
    292
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Let's keep our fingers crossed

    PHP Code:
    $string 'Having using Kaspersky Antivirus in the past, and been highly impressed, 
               I found myself looking for a new antivirus for a freshly built PC. 
               I had been using AVG 7.5 for the last year, and after becoming fed up of 
               being nagged to use the paid version of AVG8 I decided to try the latest 
               offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.'
    ;


    $regex '/[A-Z][^\.;\?\!]*(virus)[^\.;\?\!]*/';
    $string preg_replace('/(\d+)\.(\d+)/'"$1,$2"$string);

    if (
    preg_match_all($regex$string$match)) {

        foreach (
    $match[0] as &$str)
            
    $str preg_replace('/(\d+),(\d+)/'"$1.$2"$str);
            
        
    print_r($match);

    To PHP or to Perl, that is the question!
    (Bucket - simpletest) User

  9. #9
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    The problem remains that until you can describe exactly what a "sentence" is, you cannot hope to instruct the computer to isolate it for you.

    Say if you came across a badly written sentence with 2000 chars you wouldn't want to display it all would you?

    If you came across a sentence which was 2 words, that would not transfer much information to the user either would it?

    What you are trying to do is also described as creating a "Document Surrogate" one of the things I found out in search patterns.

    There seems to be two ways to go on this, you either;

    a) try and constrain your mysql full text search in the database first - only bring back x chars from the table

    b) bring back everything from the table

    if you decide on a) explode on . and choose the array item which contains the word.

    If you elect for b) you potentially are able to show EACH full-text scoring term, eg if you have 2 articles mentioning virus, the first says virus once but the second says it twice, then mysql will score the second article higher than the first - so shouldn't you display BOTH words in some context? e.g.

    Your results:

    1) I was going to buy a virus checker and thought, hell why bother? Just connect to the web when they are asleep. Take that virus."

    2) "I think I caught the virus when working in the laundry, all those sleeves."

    Its less about sentences, more about what confers the most, yet somehow manageable information to your users.

  10. #10
    Floridiot joebert's Avatar
    Join Date
    Mar 2004
    Location
    Kenneth City, FL
    Posts
    823
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    <?php

    $str 
    'Having used Kaspersky Antivirus in the past, and been highly impressed, 
               I found myself looking for a new antivirus for a freshly built PC!!! 
               This is a sentence that should not match&hellip;
               I\'d been using AVG 7.5 for the last year, and after becoming fed up of 
               being nagged to use the paid version of AVG8 I decided to try the latest 
               offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.'





    $bound '(?:[!?.;]+|&hellip;)';
    $filler '(?:[^!?.;\d]|\d*\.?\d+)*';
    $keyword 'virus';
    preg_match_all("#{$bound}({$filler}{$keyword}{$filler})(?={$bound})#si""!$str"$matches);
    echo 
    $str'<hr/><pre>'print_r($matchestrue), '</pre>';

    ?>


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •