Unicode Regex and Paragraph Returns in Form Textarea

Enver · April 23, 2013, 11:15am

I tried using this pattern to check the submitted contents of the textarea in a form:

$pattern =  '/^[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}]{2,1500}+$/u';

However it does not allow paragraph returns – despite the inclusion of ‘Z’ which, as I understand it, searches for any kind of whitespace or invisible separator (including ‘Zl’ for line separator and ‘Zp’ for paragraph separator), and the fact that the page, form and server default charset are all set to UTF-8.

So I tried this:

$pattern =  '/^[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\r\
]{2,1500}+$/u';

And it works now – although the paragraph returns are stripped out of the email when it is delivered. Can someone explain why I should have to use '\r
’ here? Also, how can I retain the paragraph formatting when the email is delivered?

Note: I would also appreciate any comment on the regex itself, which validates the message body of an email. The input has already been filtered through htmlspecialchars, striptags and stripslashes and, while I would like to limit the extent of the content, I’m not at all sure that I should be checking the characters used.

Antnee · April 23, 2013, 11:49am

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

You may be looking for the /s (DOTALL) modifier, which effectively ignores linebreaks when matching

Enver · April 23, 2013, 12:06pm

Thanks for that, Antnee: I replaced '\p{Z}\r
’ with ‘\s’ and it has indeed solved the problem.

I guess the issue with lost paragraph formatting in the delivered email should go to another thread if I can’t get to the bottom of it.

Antnee · April 23, 2013, 12:24pm

Are you using this in a preg_replace()?

Enver · April 23, 2013, 1:49pm

I was using the pattern above in a preg_match() for validation purposes but it has occurred to me that I should be looking into a preg_replace() in order to maintain formatted paragraphs in the delivered email.

The immediate problem is that I am not sure what is left to replace in the following:

$msg = "<p>".htmlspecialchars(strip_tags(stripslashes($message['Msg'])), ENT_COMPAT, 'UTF-8')."</p>";

Solution!

Running the following line after the one above has solved the issue:

$msg = preg_replace('#([\\r\
]\\s*?[\\r\
]){2,}#', '</p>$0<p>', $msg);

Antnee · April 23, 2013, 1:55pm

Hmmm… off the top of my head, personally, I think I would strip slashes and tags and then preg_replace() unwanted characters (you could actually do both of these in a preg_replace() as well, if you didn’t want to call multiple functions). It’s not until you’ve removed everything that you don’t want that you should be converting to HTML entities, and even then you’d only want to do that if you’re sending HTML emails AFAIK. If you’re doing a preg_replace() though then you don’t need to preg_match() to validate. After you’ve replaced you would simply look to ensure that the remaining string is still long enough to warrant sending an email, for example.

Enver · April 23, 2013, 2:04pm

Thanks again for the prompt reply, Antnee (so prompt that I think you may have missed the edit to my last post with the solution).

I have shortened the two lines above to:

$msg = "<p>".preg_replace('#([\\r\
]\\s*?[\\r\
]){2,}#', '</p>$0<p>', htmlspecialchars(strip_tags(stripslashes($message['Msg'])), ENT_COMPAT, 'UTF-8'))."</p>";

Which still works. But I’m not sure how I can test for the validation pattern in the same funciton.

Antnee · April 23, 2013, 2:48pm

Haha, I knew I’d open myself to more questions by mentioning that you could do them all in a preg_replace(). Regex is very powerful, but also very difficult to manage if you go into complicated patterns. And what one person thinks is perfect, another will completely disagree with. This is a classic example: http://www.regular-expressions.info/email.html Ultimately, just have it remove what you want it to remove. There’s nothing wrong with using a couple of PHP functions if you want to, just bear in mind that they’re not foolproof. strip_tags() for example won’t strip any tags that are malformed, for example.

Like I say though, you don’t need to validate if you’ve replaced already. What would you be looking for? If you’ve already removed all invalid characters then your validation would always be successful

Enver · April 23, 2013, 5:53pm

You’re absolutely right there. For beginners it’s often very hard to know where to stop. The trouble is that security issues tend to throw my basic programming knowledge into a speed wobble and I find myself slapping all sorts of filters into the code without fully understanding the implications. The current project uses the in-built properties of HTML5 but I need to reflect those in PHP validation on the server side. And of course I will also have to look into a javascript fallback for older browsers. So I do get confused, not only with the different flavours of regex but also why I’m using them.

I’ve booked myself into a course in programming security for the coming summer – hopefully that will go someway toward settling the tremors of panic I get when dealing with the issue.

Interestingly, I already looked at the page you linked in your last post – it’s a very handy site for regex brain strains in general.

Antnee · April 23, 2013, 6:04pm

Ah, I see, you’re using the HTML5 input pattern attribute? I must admit I do like this very much. Have been using it for years, since it first gained any support at all. Frankly, I tend to leave it to the browser now and handle unsupported browsers server-side, rather than adding a third layer to have to develop and support. Kudos for doing it properly and for trying out a course too. Many wouldn’t go to that much trouble. Hopefully you’ll return to share the knowledge