Parsing a text field into paragraphs

I am struggling to parse the contents of a text area into paragraphs.

There are various issues such as what defines a new paragraph. The text in this text area may be hand typed by the client or cut and pasted from a word document or simialr document.

Here is my latest attempt at the code using just a basic text area field however this does not give the correct result.



if (strtoupper(substr(PHP_OS,0,3)=='WIN')) {
  $eol="\\r\
";
} elseif (strtoupper(substr(PHP_OS,0,3)=='MAC')) {
  $eol="\\r";
} else {
  $eol="\
";
}

$article = $_POST["article"];	

$article = preg_replace("/\\r\
/", "\
", $article);

$paragraphs = explode($eol, $article);			

$howManyParagraphs = count($paragraphs);			

echo "how many paragraphs = " . $howManyParagraphs;


And here is my attempt when using the TinyMCE editor. This seems to work a bit better but not 100% of the time.



$article = $_POST["article"];	

preg_match_all("/(<h.>.*<\\/h.>)*<p>.*<\\/p>/iU", $article, $paragraphs); 

$howManyParagraphs = count($paragraphs);

echo "how many paragraphs = " . $howManyParagraphs;


Any advice appreciated

Thanks

Paul

Sounds about right to me! :smiley:

Oops sorry Anthony - I missed the notification that there was a reply.

Thank you so much for the code snippet - I will give it a go.

I am coming round to the idea that I am turning this into something really complex and should have simply had 2 text fields. If there is none in the second text field there is no “Read more” link. Sounds a lot simpler that all this parsing malarkey and probably easier for client too. Hmm I am sure you recommended such a solution:)


<?php
$string = '
<p id="bar">Foo</p>
<p>Foo</p>
<p class="bar">Foo</p>
<h4>foo</h4>
';

echo preg_match_all('~<p[^>]*>([^<]*)</p>~i', $string, $m); #3
?>

Or…


<?php
$string = '
<p id="bar">Foo</p>
<p>Foo</p>
<p class="bar">Foo</p>
<h4>foo</h4>
';

$doc = new SimpleXMLElement(sprintf('<root>&#37;s</root>', $string));
echo count($doc->xpath('//p')); #3
?>

:slight_smile:

The following is close but no cigar



$howManyMatches = preg_match("#<p[^>]*>(.*)</p>#isU", $article,$paragraphs);



Ok slight flaw in my solution in that it is not detecting paragraphs that have a class.

My code is as follows:



$article = $_POST["article"];	

// Remove paragraphs that are just empty space
$article = str_replace("<p>&nbsp;</p>", "", $article);			

$howManyParagraphs = preg_match("/<p>(.*)<\\/p>/",$article,$paragraphs);


The above works fine if the paragraphs use plain <p></p> tags but if they use say

<p class=“blah”></p>

It doesn’t work.

So I would like help adjusting my preg match syntax to deal with this alternative paragraph format.

Thanks

Paul

Thanks Anthony I will check that out.

Salathe - I guess it really ought to be a carriage return.

However, I have had a go another stab at doing this with the TinyMCE editor and the following seems to work though I haven’t tested exhaustively.



$article = $_POST["content"];	

$article = str_replace("<p>&nbsp;</p>", "", $article);

$howManyMatches = preg_match_all("/<p>(.*)<\\/p>/",$article,$paragraphs);

echo "How many matches = " . $howManyMatches;


So, what defines a new paragraph?

Check out PHP_EOL Paul, this simplified pattern matches text that is not part of a HTML tag.


(?<=^|>)[^><]+?(?=<|$)