SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Member
    Join Date
    Apr 2005
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question String Intersections?

    As a pet project I recently started working on developing a search engine for a single domain. I do not expect it to be very complex in the end, but I would like it to have some of the features of a full search engine. I recently finished most of the project and ironically got stuck on what I originally figured would be a trivial part of the engine, generating dynamic summaries.

    Over the weekend I decided to start working on the summaries. It did not take long to write a piece of code that simply extracted out a series of keywords along with the five words before and after each keyword and then placed then concatenated each string of nine words together with ellipses between them. I was going to expand the functionality to include finding all occurrences of the keywords and finding the ones that were closes together in order to make the summary, but then I ran in to a little problem


    PHP Code:
    ?php

    function summarize $text$keywords ) {
    $text preg_replace '/\<(.*?)\>/'''$text );

    $match = array();
    $sentences = array();
    $i 0;
    foreach( 
    $keywords as $word ) {
     
    preg_match '/(\w*\s*){0,4}' $word '(\s*\w*){0,4}/i'$text, &$match );
     
    $sentences[$i] = preg_replace'/' $word '/i'"<b>$word</b>"$match[0] );
     
    $i++;
    }

    $summary "";
    foreach( 
    $sentences as $sentence ) {
     
    $summary .= $sentence '... ';
    }
    //$summary = substr( $summary, 0, -4 );
    return $summary;
    }

    //Exsample Use
    /* echo summarize( "The town of Tabuko was located near the corner of a river and the lake of Ba-i which was made bancas or raft as the common means of transportation going to the town of Tabuko. There were many trees of kabuyaw growing around the area. The fruit of kabuyaw was used as shampoo. So, when the priest asked for the name of the place", array( 'located', 'near', 'corner' ) ); */


    ?> 
    The problem is that if the keywords are to close together the segments will overlap. I am looking for advice on a way to find the intersection of two strings and merge them if it exists with a fairly decent runtime. I am open up to any ideas that would help archive this.

  2. #2
    SitePoint Member
    Join Date
    Jan 2005
    Location
    chennai
    Posts
    14
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hi,
    could you please give your expected output for the example whichone you have given?

  3. #3
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, lets try this

    PHP Code:

    function summarize $text$keywords$len ) {
        global 
    $s_buf$s_kw;

        
    $s_buf = array();
        
    $s_kw   "\b(" implode("|"$keywords) . "\b)";
        
    $text preg_replace_callback(
            
    "~(\w+\W+){0,$len}$s_kw(\W+\w+){0,$len}~",
            
    create_function('$m''
                global $s_buf, $s_kw;
                $p = preg_replace(
                    "~$s_kw~",
                    "<em>$0</em>",
                    $m[0]
                );
                $s_buf[] = $p;
                return "<span>$p</span>";
            '
    ),
            
    $text
        
    );
        echo 
    "<style>span {background:#e0e0e0}</style>";
        echo 
    implode('...'$s_buf);
        echo 
    "<hr>";
        echo 
    $text;
    }

    $text "
    Among other public buildings in a certain town, which for many reasons it will be prudent to refrain from mentioning, and to which I will assign no fictitious name, there is one anciently common to most towns, great or small: to wit, a workhouse; and in this workhouse was born; on a day and date which I need not trouble myself to repeat, inasmuch as it can be of no possible consequence to the reader, in this stage of the business at all events; the item of mortality whose name is prefixed to the head of this chapter.
    For a long time after it was ushered into this world of sorrow and trouble, by the parish surgeon, it remained a matter of considerable doubt whether the child would survive to bear any name at all; in which case it is somewhat more than probable that these memoirs would never have appeared; or, if they had, that being comprised within a couple of pages, they would have possessed the inestimable merit of being the most concise and faithful specimen of biography, extant in the literature of any age or country.
    "
    ;

    summarize(
        
    $text,
        array( 
    'that''and''the' )
    ); 
    Looks quite ugly though


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •