Echo a paragraph containing a string

Hey, I’m looking for a bit of help, I’m starting out learning PHP and am trying to find a string in from a bunch of paragraphs on a page. Then echo the complete paragraph that string was found in.

Can anyone help me out with this? Links to tutorials or ideas of how I would go about it?

Thanks a lot!

I’m sure someone’s about to spew a code block out here, but I like to make people think. :stuck_out_tongue:

Consider your steps.
You’re looking for a string, so you’re going to be doing some string searching function.
You’re trying to do it on a paragraph basis, so you need to find a way to split up the paragraphs.

Logical pattern:
Divide the text into paragraphs.
Search each paragraph. (Until found? Find all?)
Output results.

Haha well that would be nice if they explained the lines of code!

I was thinking using regEx but I’m not sure if it’s the right function for the job. Also I’m not sure how to say to “get” the content between an opening and closing p tag or the content between break tags if a string is in that paragraph or block of text.

Well I assume you’re here to learn, which is why i find it better to do it this way :stuck_out_tongue:

You could, infact, use regex to look this up. Let me ask this; what output do you want from your script? Just the text of the paragraph the string is contained in? or… the paragraph #? or…?

Well I’d like to learn how to do it anyway!

I want to get just the text of the paragraph the string is contained in.

This is just a test I’m working on. I want to get the paragraphs that contain his first name but I haven’t got much past outputting an array of any text with containing his name. I guess I just need to limit it to when his name is mentioned in within p tags.

<?php

$source = "http://en.wikipedia.org/wiki/Conan_O'Brien";
$html_in_site = file_get_contents($source);

$start = strpos($html_in_site, '<body>');
$end = strpos($html_in_site, '</body>', $start);
$paragraph = substr($html_in_site, $start, $end-$start+4);
$paragraph = html_entity_decode(strip_tags($paragraph, ''));

$matches = array(); 
preg_match_all('/^.*?Conan.*?$/imU', $paragraph, $matches); 


echo '<pre>' . print_r($matches, true) . '</pre>'; 
?>

Fancy seeing something ridiculously cool ?

SELECT * FROM html WHERE url = “http://en.wikipedia.org/wiki/Conan_O’Brien” AND xpath= ‘//p’

*Sorry StarLion :frowning:

Well from what i see, you’ve pretty much got it. Consider that all paragraphs start with <p> and end in </p> .

Try splitting the text into an array before stripping the tags out. (or going the other way - replace the paragraph closing tags with a placeholder, strip, then split over your placeholder) Then loop over the array looking for the elements. Any time you find one, you know what the paragraph’s text was.

Thanks for that I know what you’re talking about but I’m not quite sure how to code that part of it. Is there any tutorials or examples you could link me to?

Well I know Anthony’s probably gonna show me up on this one, but i’d do it this way.
And yes, i know this line is incredibly long, but if you broke it down into individual lines it’s long and repetitive :stuck_out_tongue:


$html_in_site = explode('#SNURFY#',html_entity_decode(strip_tags(str_replace("<p>","#SNURFY#",array_shift(explode('<!-- /bodytext -->',array_pop(explode('<!-- bodytext -->',$html_in_site)))),''))));

PS: Yay for wikipedia putting comments in their code to help parsers.

See if you can detangle that mess enough to understand it… it’s the opposite of a fancy dinner. work inside out.

Cool thanks! What exactly is that #SNURFY# business? Also where in the code I posted would I place that line? Thanks again.

#SNURFY# is just ‘Here is a placeholder I dont think will be in the text anywhere ever’ (because if your placeholder occured in the text, you’d have an extra break at that point).

It would replace everything between the get_contents and the preg; you’d also need to add a foreach to walk across the array of paragraphs we’ve just created, run the match on each in turn, and if match(es) are found, echo the paragraph.

Not sure if I have this right at all?

Sorry about my slowness on the uptake on this!

<?php

$source = "http://en.wikipedia.org/wiki/Conan_O'Brien";
$html_in_site = file_get_contents($source);

$html_in_site = explode('#SNURFY#',html_entity_decode(strip_tags(str_replace("<p>","#SNURFY#",array_shift(explode('<!-- /bodytext -->',array_pop(explode('<!-- bodytext -->',$html_in_site)))),''))));  

preg_match_all('/^.*?Conan.*?$/imU', $html_in_site, $matches); 


echo '<pre>' . print_r($matches, true) . '</pre>'; 
?>

I’m more a fan of using [fphp]strpos[/fphp] in cases like these:


$pos=0;
while ($pos<strlen($text))
{
  if (($start=strpos($text,'<p>',$pos))===false || ($end=strpos($text,'</p>',$pos+3))===false)
    break;
  $contents=substr($text,$start+3,$end-$start-3);
  if (strpos($contents,$searchWord)!==false)
    echo $contents;
  $pos=$end+4;
}

Where $text is the text to search and $searchWord is the word to look for.

If you want it case insensitive:


$pos=0;
$searchText=strtolower($text);
$searchWord=strtolower($searchWord);
while ($pos<strlen($searchText))
{
  if (($start=strpos($searchText,'<p>',$pos))===false || ($end=strpos($searchText,'</p>',$pos+3))===false)
    break;
  $contents=substr($searchText,$start+3,$end-$start-3);
  if (strpos($contents,$searchWord)!==false)
    echo substr($text,$start+3,$end-$start-3);
  $pos=$end+4;
}

Looks scary, but it works :slight_smile:

@ScallioXTX gave that a whirl and it didn’t seem to work right. I think I didn’t match it up properly maybe.

Edit: and thanks!

<?php

$source = "http://en.wikipedia.org/wiki/Conan_O'Brien";
$html_in_site = file_get_contents($source);
$searchWord = "Conan";

$start = strpos($html_in_site, '<body>');
$end = strpos($html_in_site, '</body>', $start);
$paragraph = substr($html_in_site, $start, $end-$start+4);
$paragraph = html_entity_decode(strip_tags($paragraph, '<p></p>'));

$pos=0; 
while ($pos<strlen($paragraph)) 
{ 
  if (($start=strpos($paragraph,'<p>',$pos))===false || ($end=strpos($text,'</p>',$pos+3))===false) 
    break; 
  $contents=substr($paragraph,$start+3,$end-$start-3); 
  if (strpos($contents,$searchWord)!==false) 
    echo $contents; 
  $pos=$end+4; 
}  


?>

Aaah, but you’re mixing two methods. No, that won’t work :slight_smile:
Try this:

<?php
$source = "http://en.wikipedia.org/wiki/Conan_O'Brien";
$html_in_site = file_get_contents($source);
$searchWord = "Conan";

$start = strpos($html_in_site, '<body>');
$end = strpos($html_in_site, '</body>', $start+6);
$paragraph = substr($html_in_site, $start, $end-$start-6);
$paragraph = html_entity_decode($paragraph);

$pos=0; 
while ($pos<strlen($paragraph)) 
{ 
  if (($start=strpos($paragraph,'<p>',$pos))===false || ($end=strpos($paragraph,'</p>',$pos+3))===false) 
    break; 
  $contents=substr($paragraph,$start+3,$end-$start-3); 
  if (strpos($contents,$searchWord)!==false) 
    echo $contents, "\
"; 
  $pos=$end+4; 
}  

Wow thank you! :slight_smile: That does the job, now to go through it and learn exactly how it works! Thanks again!

Hi LindenWalsh, welcome to Sitepoint. I know that you’ve found a solution that you’re happy with, but you did mention “the right function for the job”. PHP has tools available specifically for making sense out of HTML documents without needing to get down and dirty with regular expressions or looking at the HTML with basic string functions to get the information that you are looking for.

One such tool is the DOM extension (Document Object Model) which provides neat ways to move around a HTML document to find what we need. A basic example of searching through all of the paragraph elements on the page and outputting the 19 paragraphs (at the time of writing this) which contain the word Conan might look something like the following:


<?php
    
$doc = new DOMDocument;
$doc->loadHTMLFile("http://en.wikipedia.org/wiki/Conan_O'Brien");

$paras = $doc->getElementsByTagName('p');
foreach ($paras as $para) {
    if (strpos($para->textContent, "Conan") !== FALSE) {
        echo $doc->saveXML($para);
    }
}

?>

Cool thanks! That seems easier to understand as well and less code is better. Thanks to all of you though for the help!

Is there a way to set limits to what kind of content comes through? I know I could use Strip Tags to remove images and scripts but is there a more efficient way to do that?

Maybe, what kind of limits?