# Thread: Logic Behind Article Uniqueness?

1. ## Logic Behind Article Uniqueness?

Hello does anybody know what's the logic behind determining the uniqueness of an article in percentage?

For example sentence A:
The quick brown fox jumped over the lazy dog.

When compared to sentence B:
The fast white fox leaped above the stubborn dog.

You could say that sentence B is about 45% unique. But how do you compute that mathematically using PHP perhaps? Any ideas? Thanks.

2. PHP Code:
``` //For example sentence A:\$a = "The quick brown fox jumped over the lazy dog.";//When compared to sentence B:\$b = "The fast white fox leaped above the stubborn dog.";\$base = explode(" ", \$a) ;\$compare = explode(" ", \$b) ;\$diffs = count( array_diff( \$compare, \$base ) ) ;\$origin = count( \$base );echo (\$origin * \$diffs)  ;  ```
= 45

I left it verbose so you can work out this way of doing it, there could be better ways.

3. Doesn't consider position of words, but:
PHP Code:
``` \$sentenceA = 'The quick brown fox jumped over the lazy dog.'; \$sA = str_word_count(\$sentenceA,1); \$sentenceB = 'The fast white fox leaped above the stubborn dog.'; \$sB = str_word_count(\$sentenceB,1); \$xRef = array_intersect(\$sA,\$sB); \$xResult = (count(\$xRef) / count(\$sA)) * 100; print_r(\$xRef); echo '<br />'.\$xResult.'%';  ```

4. PHP has some built-in comparison functions you might be able to take advantage of.

e.g. http://www.php.net/manual/en/function.similar-text.php

5. Originally Posted by me
here could be better ways.
.. and there was. I think mine worked by accident ...

6. Thanks guys. I'll work on those examples

7. SB
PHP Code:
``` echo round(\$diffs / \$origin * 100) ;  ```
Math brain fart, string b has 5 words different than string a = ~55&#37;, no?

Coincidentally this also seems to work well if you mix the words round too, maybe that is not what you want though;
PHP Code:
``` //For example sentence A:\$a = "The quick brown fox jumped over the lazy dog.";//When compared to sentence B:\$b = "The fast white dog leaped above the stubborn fox.";  ```
= 78% wrong

8. http://tartarus.org/~martin/PorterStemmer/
There's a php implementation.

Using that would allow "jump", "jumped", and "jumping" to compare as the same word(but not jumper!). It's pretty cool.

It won't stem jumped and leaped to the same word, but it's probably a move in the right direction still.

9. Originally Posted by crmalibu
It won't stem jumped and leaped to the same word, but it's probably a move in the right direction still.
For that you'd need a porterstemasaurus.

The PHP implementation of the Porter stemming algorithm works pretty well for English language texts: I tested it out with some nice chunks of prose over my Christmas break.
Now you've temped me to try integrating it with a thesaurus to try and identify commonality of meaning even when the words used vary. Not sure of any practical applications, but a pleasant academic exercise.

I've heard rumours that there have been similar techniques developed for stemming in other languages (French has been mentioned several times) but haven't managed to find anything on the web about this other than speculations.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•