SitePoint Sponsor |
|
User Tag List
Results 1 to 9 of 9
Thread: Logic Behind Article Uniqueness?
-
Mar 12, 2009, 07:52 #1
- Join Date
- Aug 2002
- Posts
- 385
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Logic Behind Article Uniqueness?
Hello does anybody know what's the logic behind determining the uniqueness of an article in percentage?
For example sentence A:
The quick brown fox jumped over the lazy dog.
When compared to sentence B:
The fast white fox leaped above the stubborn dog.
You could say that sentence B is about 45% unique. But how do you compute that mathematically using PHP perhaps? Any ideas? Thanks.
-
Mar 12, 2009, 09:02 #2
- Join Date
- Oct 2006
- Location
- France, deep rural.
- Posts
- 6,869
- Mentioned
- 17 Post(s)
- Tagged
- 1 Thread(s)
PHP Code://For example sentence A:
$a = "The quick brown fox jumped over the lazy dog.";
//When compared to sentence B:
$b = "The fast white fox leaped above the stubborn dog.";
$base = explode(" ", $a) ;
$compare = explode(" ", $b) ;
$diffs = count( array_diff( $compare, $base ) ) ;
$origin = count( $base );
echo ($origin * $diffs) ;
I left it verbose so you can work out this way of doing it, there could be better ways.
-
Mar 12, 2009, 09:04 #3
- Join Date
- Jun 2006
- Location
- Wigan, Lancashire. UK
- Posts
- 523
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Doesn't consider position of words, but:
PHP Code:$sentenceA = 'The quick brown fox jumped over the lazy dog.';
$sA = str_word_count($sentenceA,1);
$sentenceB = 'The fast white fox leaped above the stubborn dog.';
$sB = str_word_count($sentenceB,1);
$xRef = array_intersect($sA,$sB);
$xResult = (count($xRef) / count($sA)) * 100;
print_r($xRef);
echo '<br />'.$xResult.'%';
-
Mar 12, 2009, 09:05 #4
- Join Date
- Sep 2005
- Location
- Canada
- Posts
- 228
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
PHP has some built-in comparison functions you might be able to take advantage of.
e.g. http://www.php.net/manual/en/function.similar-text.php
-
Mar 12, 2009, 09:08 #5
- Join Date
- Oct 2006
- Location
- France, deep rural.
- Posts
- 6,869
- Mentioned
- 17 Post(s)
- Tagged
- 1 Thread(s)
Originally Posted by me
-
Mar 12, 2009, 09:13 #6
- Join Date
- Aug 2002
- Posts
- 385
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Thanks guys. I'll work on those examples
-
Mar 12, 2009, 09:33 #7
- Join Date
- Oct 2006
- Location
- France, deep rural.
- Posts
- 6,869
- Mentioned
- 17 Post(s)
- Tagged
- 1 Thread(s)
SB
PHP Code:echo round($diffs / $origin * 100) ;
Coincidentally this also seems to work well if you mix the words round too, maybe that is not what you want though;
PHP Code://For example sentence A:
$a = "The quick brown fox jumped over the lazy dog.";
//When compared to sentence B:
$b = "The fast white dog leaped above the stubborn fox.";
-
Mar 12, 2009, 10:07 #8
- Join Date
- Jul 2008
- Posts
- 5,757
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
http://tartarus.org/~martin/PorterStemmer/
There's a php implementation.
Using that would allow "jump", "jumped", and "jumping" to compare as the same word(but not jumper!). It's pretty cool.
It won't stem jumped and leaped to the same word, but it's probably a move in the right direction still.
-
Mar 12, 2009, 11:49 #9
- Join Date
- Jun 2006
- Location
- Wigan, Lancashire. UK
- Posts
- 523
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
For that you'd need a porterstemasaurus.
The PHP implementation of the Porter stemming algorithm works pretty well for English language texts: I tested it out with some nice chunks of prose over my Christmas break.
Now you've temped me to try integrating it with a thesaurus to try and identify commonality of meaning even when the words used vary. Not sure of any practical applications, but a pleasant academic exercise.
I've heard rumours that there have been similar techniques developed for stemming in other languages (French has been mentioned several times) but haven't managed to find anything on the web about this other than speculations.
Bookmarks