SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)

    Inteligently parse out repeating text phrase?

    I fetch an RSS feed with 10 headlines and corresponding links.

    Each title starts with the same phrase, eg:

    $arr[0]['title'] = "Daily Reporter - News - Stop press! - This is story number one, the good stuff";
    $arr[1]['title'] = "Daily Reporter - News - Stop press! - This is story number two, just the heading";
    $arr[2]['title'] = "Daily Reporter - News - Stop press! - Daily milk rounds stop";

    And I just want to teach it to remove the marketing and give me :

    $arr[0]['title'] = "This is story number one, the good stuff";
    $arr[1]['title'] = "This is story number two, just the heading";
    $arr[2]['title'] = "Daily milk rounds stop";

    Edit:

    Samples displayed as arrays

    This RSS feed will only be one of very many, so the marketing-cruft will differ.

    For the sake of argument, the marketing-cruft will always be at the start of the string.

    The marketing-cruft will sometimes change without warning.

    The RSS is only fetched once a day and then cached, so I am not concerned about optimizing the algorithm.

    Any users clicking the text will be taken to the press site in any case, so I don't feel I am ripping them off.

    I envisage a first loop through the headlines doing some kind of diff that identifies the marketing string, then a second loop that removes the string and writes to the file.

    Any pointers on how to get PHP to cleverly identify just the marketing-cruft?
    Last edited by Cups; Jun 8, 2009 at 04:04. Reason: Made the examples arrays instead of objects, easier to play with ...

  2. #2
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Select any two headlines, and check character by characters whether the strings match. The point at which they differ is the end of the common prefix. You can repeat this with more headlines until you're confident you've figured out what the prefix is (3 is probably enough).

  3. #3
    SitePoint Wizard
    Join Date
    Nov 2005
    Posts
    1,191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'd say that's a pretty difficult task you've set yourself.

    Character by character won't work:
    "News - Stop - The queen declares war"
    "News - Stop - The king is defeated"

    (shoddy example, but plenty of titles on a news site will start with the same word: "Top ten" "Obama" "Police")

    I'd imagine it's a frequency thing though, eg
    100% of articles from domain.com start with "Domain News -"
    50% start with "Domain News - Sports"
    10% start with "Domain News - Sports - The All Blacks"

    So you could store frequencies and assign weights to let the script guess as to what is or isn't part of a unique title. Since you mention "teach it", how many feeds would there be, and how many different "crufts"? i.e. it could be worth it to write some code that lets you tell it absolutely that a certain string is not part of a title. Separators would probably help as well, eg '-' might only appear on average in 5% of sentences, so: 2 '-' within less than 5 words at the start etc.

  4. #4
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    @dan Thanks, yes, that is just what I imagined I would have to do, and yes evidence of the third match would be the clincher for me.

    Thinks: maybe word for word comparisons would work, headlines tend to be reused as slugs for seo e.g. domain. com/articles/200906/daily-milk-round-stops

    @hash Probably many scores, if not hundreds. I am only looking at your first case - 100% of them would have "Domain News - " which would be the target string to remove.

    re: Separators, I don't see them as being relevant.

    Thanks for the comments though.

    I am going to see how far I get stripping out non-alnum chars and spaces, explode() ing each title into words and then running array_diff on the word collections.

    Might work, I will likely have to come up with many more use cases though.

  5. #5
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    This works, in case anyone has the same problem - its pretty hackish but should serve to show the idea.
    PHP Code:
    // your array of titles, all prefixed with a marketing splurge
    $arr[] = "The Daily Record - Sport - red shorts";
    $arr[] = "The Daily Record - Sport - blue shirts";
    $arr[] = "The Daily Record - Gardening - green fingers";

    $first explode" " $arr[0] );
    $second explode(" "$arr[1] );
    $third explode(" "$arr[2] );

    $prefix '' ;  // create a var to capture each word

    foreach( $first as $k=>$v ){

     if( 
    $second[$k] === $v && $third[$k] === $v ) { 
      
    $prefix .=$v ' ';  // add the space back after you exploded it

      
    }  else {
      break ; 
    // no match found, end of loop

     
    }

    }

    var_dump$prefix ); // "The Daily Record - "

    foreach( $arr as $v )
    echo 
    str_replace(  $prefix'' $v ); 
    result:

    Sport - red shorts
    Sport - blue shirts
    Gardening - green fingers

  6. #6
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,396
    Mentioned
    61 Post(s)
    Tagged
    0 Thread(s)
    Hi Cups,

    I'm not sure if you're actually still looking for any more comments to this thread but I had a quick bash at your problem and came up with a solution, different to yours but maybe useful—if not to you, perhaps to someone else.

    The idea is to loop through the array checking to see if the first n characters of each array item are the same throughout the entire array. Keep going until the character sequences are not identical, at which point you know the length of any common prefix strings. Then just chomp the original array's items down to length.


    PHP Code:
    $arr[] = "The Daily Record - Sport - red shorts";
    $arr[] = "The Daily Record - Sport - blue shirts";
    $arr[] = "The Daily Record - Gardening - green fingers";

    var_dump(remove_prefix($arr));

    function 
    remove_prefix($items)
    {
        
    // Number of items in our array
        
    $items_len count($items);
        
        
    // No common prefix if zero or 1 items in the array!
        
    if ($items_len <= 1)
        {
            return 
    $items;
        }
        
        
    // Determine length of shortest item
        
    $limit 9999999999;
        foreach (
    $items as $item)
        {
            if ((
    $len strlen($item)) < $limit)
            {
                
    $limit $len;
            }
        }

        
    // Starting offsets for the mapped substr calls
        
    $starts array_fill(0$items_len0);
        
        
    // Algorithm to find common prefixes
        
    $offset 0;
        do 
        {
            
    // Generate array of stubs
            
    $stubs array_map('substr'$items$startsarray_fill(0$items_len$offset 1));
            
            
    // If all stubs are identical, count will equal 1
            
    if (count(array_count_values($stubs)) !== 1)
            {
                
    // Found a difference, break out of loop
                
    break;
            }
        
        } while (
    $offset++ < $limit);
        
        
    // Generate array with common prefix removed
        
    return array_map('substr'$itemsarray_fill(0$items_len$offset));

    Salathe
    Software Developer and PHP Manual Author.

  7. #7
    SitePoint Wizard
    Join Date
    Nov 2005
    Posts
    1,191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    heh, perhaps I was over complicating things, but what happens with this:

    PHP Code:
    $arr[] = "The Daily Record - Sport - red shorts";
    $arr[] = "The Daily Record - Sport - blue shirts";
    $arr[] = "The Daily Record - Sport - white sox";
    $arr[] = "The Daily Record - Gardening - green fingers"
    Or

    PHP Code:
    $arr[] = "The Daily Record - The bank did something";
    $arr[] = "The Daily Record - The government did something";
    $arr[] = "The Daily Record - Theatre burns down";
    $arr[] = "The Daily Record - Themes - Free Themes"
    Last edited by hash; Jun 8, 2009 at 18:21.

  8. #8
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,396
    Mentioned
    61 Post(s)
    Tagged
    0 Thread(s)
    With my code, the latter would snip off "The Daily Record - The" from each string as expected (its job is to find and remove any common prefix after all).
    Salathe
    Software Developer and PHP Manual Author.

  9. #9
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Thanks for taking the time to work that out and show another way of doing it.

    The prefix was just too long in the case I had in mind - the prefix was an incredible 58 chars long and the $title was truncated to 63, so when I had parsed out the prefix I was left with a useless 5 chars.

    The next few use cases did not even prefix their feeds - so I have back-burnered this issue for a few days.

    I am musing that if the prefix (the marketing cruft someone prefixes all their rss feed titles with at source) is longer than say, 45 chars, I can optionally go on and substitute a truncated version of the description instead.

    Take this lifelike example ( hard returns added for readability)
    PHP Code:
    $arr[0]['title'] = "The Daily Record - News - All Stop Press - 
    We Never Sleep - Cat gets ..."

    $arr[0]['description'] = "It emerged today that a cat 
    found up a tree on the high street had once belonged 
    to the muppet Kermit the frog"

    So if I work out that the prefix used on the first x feeds is "The Daily Record - News - All Stop Press - We Never Sleep - " ( ie > 45 chars) use a truncated version of the description instead;

    "It emerged today that a cat found up a tree on the high street ... ".

    Just in case you weren't following this thread, the alternative is too ridiculous to think about, having a list of headings derived from an RSS feed which reads like this;

    The Daily Record - News - All Stop Press - We Never Sleep - Cat gets ...
    The Daily Record - News - All Stop Press - We Never Sleep - Man sho ...
    The Daily Record - News - All Stop Press - We Never Sleep - Mayor s....

    If I get round to rolling it into a function/class I'll post it back on here.

    Thanks again!

    ps If you are responsible for creating RSS feeds for others to "consume", please dont prefix your titles with your marketing nonsense, or keep it to a few characters, you will just be reducing the likelihood that someone will be tempted to click the link to your site. #seofail
    Last edited by Cups; Jun 9, 2009 at 00:39. Reason: added postscript


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •