SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Member
    Join Date
    Mar 2010
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Google News - How does it do it?

    Ive posted this in here mainly as PHP is the only language i know well.

    I have news on my site - which is pretty basic - I added a couple of RSS feeds and i have a script that loops through them and matches them up to a list of team names. What im finding is that im ending up with a load of duplicated stories - as you might expect all the major sites post a similar story about the same thing - I want to try and group these together - Ive been thinking about it for a while and cant really get a picture of the best way to go about this.

    The simplest method i came up with was if a story matches the same team and the same player and is within an hour of the original then they may well be about the same thing but this seemed a bit crap to be honest.

    I wondered if anyone could explain how google news groups its story - done a bit of searching on the net and there are several sites that explain the principle and i can see that it finds stories the same but nothing that goes into technically how it is done - Im not looking for anyone to give me code or anything like that - quite looking for ward to having a go at coding it myself - was more hoping to get a bit of a discussion going about it can be done.

    Thanks

  2. #2
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,696
    Mentioned
    101 Post(s)
    Tagged
    4 Thread(s)
    You may want to start by having a look at the similar_text function.
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  3. #3
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Hm, interesting...
    PHP Code:
    $headlines = array(
        
    'Anthony likes M&Ms',
        
    'Anthony likes sweets',
        
    'Free mugs with every purchase',
        
    'Sport, its a mugs game',
        
    'Mugs, free with every purchase',
    );

    $tolerance 17;

    foreach(
    $headlines as $headline){
        echo 
    '<h1>'$headline ,'</h1>';
        echo 
    '<p>Possibly related stories:-</p>';
        echo 
    '<ul>';
        foreach(
    $headlines as $related){
            if(
    $headline !== $related && $tolerance >= levenshtein($headline$related)){
                echo 
    '<li>'$related'</li>';
            }
        }
        echo 
    '</ul>';

    Code:
    Anthony likes M&Ms
    Possible related stories:-
        * Anthony likes sweets
    
    Anthony likes sweets
    Possible related stories:-
        * Anthony likes M&Ms
    
    Free mugs with every purchase
    Possible related stories:-
        * Mugs, free with every purchase
    
    Sport, its a mugs game
    Possible related stories:-
         *
    
    Mugs, free with every purchase
    Possible related stories:-
        * Free mugs with every purchase
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  4. #4
    SitePoint Member
    Join Date
    Mar 2010
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by AnthonySterling View Post
    Hm, interesting...
    PHP Code:
    $headlines = array(
        
    'Anthony likes M&Ms',
        
    'Anthony likes sweets',
        
    'Free mugs with every purchase',
        
    'Sport, its a mugs game',
        
    'Mugs, free with every purchase',
    );

    $tolerance 17;

    foreach(
    $headlines as $headline){
        echo 
    '<h1>'$headline ,'</h1>';
        echo 
    '<p>Possibly related stories:-</p>';
        echo 
    '<ul>';
        foreach(
    $headlines as $related){
            if(
    $headline !== $related && $tolerance >= levenshtein($headline$related)){
                echo 
    '<li>'$related'</li>';
            }
        }
        echo 
    '</ul>';

    Code:
    Anthony likes M&Ms
    Possible related stories:-
        * Anthony likes sweets
    
    Anthony likes sweets
    Possible related stories:-
        * Anthony likes M&Ms
    
    Free mugs with every purchase
    Possible related stories:-
        * Mugs, free with every purchase
    
    Sport, its a mugs game
    Possible related stories:-
         *
    
    Mugs, free with every purchase
    Possible related stories:-
        * Free mugs with every purchase
    Thats really interesting - i had tried levenshtein function a while back on something different and i had dismissed it but i had not considered testing on the tolerance being over a certain limit - From that experiment there it definately seems to offer some potential.

    Ill take a look at the similar text function as well - ive been looking at a couple of sites that do new comparison and they could definately be using some form of that levenshtein function.

    Anybody else has any suggestions - ill post back on the current ones in a day or two once ive added them to my current script. I know google are notoriously secretive but has anyone read anything about google news?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •