SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Addict
    Join Date
    Aug 2002
    Posts
    385
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Logic Behind Article Uniqueness?

    Hello does anybody know what's the logic behind determining the uniqueness of an article in percentage?

    For example sentence A:
    The quick brown fox jumped over the lazy dog.

    When compared to sentence B:
    The fast white fox leaped above the stubborn dog.

    You could say that sentence B is about 45% unique. But how do you compute that mathematically using PHP perhaps? Any ideas? Thanks.

  2. #2
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    PHP Code:
    //For example sentence A:
    $a "The quick brown fox jumped over the lazy dog.";

    //When compared to sentence B:
    $b "The fast white fox leaped above the stubborn dog.";

    $base explode(" "$a) ;

    $compare explode(" "$b) ;

    $diffs countarray_diff$compare$base ) ) ;

    $origin count$base );

    echo (
    $origin $diffs)  ; 
    = 45

    I left it verbose so you can work out this way of doing it, there could be better ways.

  3. #3
    SitePoint Evangelist
    Join Date
    Jun 2006
    Location
    Wigan, Lancashire. UK
    Posts
    523
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Doesn't consider position of words, but:
    PHP Code:
    $sentenceA 'The quick brown fox jumped over the lazy dog.';
    $sA str_word_count($sentenceA,1);

    $sentenceB 'The fast white fox leaped above the stubborn dog.';
    $sB str_word_count($sentenceB,1);

    $xRef array_intersect($sA,$sB);

    $xResult = (count($xRef) / count($sA)) * 100;

    print_r($xRef);

    echo 
    '<br />'.$xResult.'%'
    ---
    Development Projects:
    PHPExcel
    PHPPowerPoint

  4. #4
    SitePoint Addict Trent Reimer's Avatar
    Join Date
    Sep 2005
    Location
    Canada
    Posts
    228
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP has some built-in comparison functions you might be able to take advantage of.

    e.g. http://www.php.net/manual/en/function.similar-text.php

  5. #5
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by me
    here could be better ways.
    .. and there was. I think mine worked by accident ...

  6. #6
    SitePoint Addict
    Join Date
    Aug 2002
    Posts
    385
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks guys. I'll work on those examples

  7. #7
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    SB
    PHP Code:
    echo round($diffs $origin 100) ; 
    Math brain fart, string b has 5 words different than string a = ~55&#37;, no?

    Coincidentally this also seems to work well if you mix the words round too, maybe that is not what you want though;
    PHP Code:
    //For example sentence A:
    $a "The quick brown fox jumped over the lazy dog.";

    //When compared to sentence B:
    $b "The fast white dog leaped above the stubborn fox."
    = 78% wrong

  8. #8
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    http://tartarus.org/~martin/PorterStemmer/
    There's a php implementation.

    Using that would allow "jump", "jumped", and "jumping" to compare as the same word(but not jumper!). It's pretty cool.

    It won't stem jumped and leaped to the same word, but it's probably a move in the right direction still.

  9. #9
    SitePoint Evangelist
    Join Date
    Jun 2006
    Location
    Wigan, Lancashire. UK
    Posts
    523
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by crmalibu View Post
    It won't stem jumped and leaped to the same word, but it's probably a move in the right direction still.
    For that you'd need a porterstemasaurus.

    The PHP implementation of the Porter stemming algorithm works pretty well for English language texts: I tested it out with some nice chunks of prose over my Christmas break.
    Now you've temped me to try integrating it with a thesaurus to try and identify commonality of meaning even when the words used vary. Not sure of any practical applications, but a pleasant academic exercise.

    I've heard rumours that there have been similar techniques developed for stemming in other languages (French has been mentioned several times) but haven't managed to find anything on the web about this other than speculations.
    ---
    Development Projects:
    PHPExcel
    PHPPowerPoint


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •