In BYO Database Driven Web Site (4th Edn), Kevin Yank uses ‘htmlspecialchars’ to strip (or, rather, convert) tags off user-entered data. Is this better, worse, or just different from ‘strip_tags’ ? The latter doesn’t display the changed tags, which might be seen as an advantage, except perhaps that then one wouldn’t know they’d ever been there.
I’ve been experimenting with ways to strip (or convert) tags when reading in the variables from $_POST/$_GET instead of later when (if ?) they are echoed to screen. Is this good practice, or are there pitfalls ?
Thank you. I realise they aren’t identical, and I’ve looked them up in PHP Help.
But both will, in effect, remove html tags from a string; htmlspecialchars converts them (as its name implies) and strip_tags does just that.
Are they both equally effective if one wants to sanitise user entered text ?
Is there a reason why one would use one in preference to the other for that purpose ?
Thank you. That provides a clear rule about when one should use one or the other function. I fully embrace the need to sanitise anything that the user might be able to enter or amend en route.
I still haven’t understood why sometimes it’s better to convert tags to html entities and at other times to strip them altogether. For example, if a user enters his/her name on a login form (as in Chap3 of BYO Database Driven Web Site), surrounded with bold tags: htmlspecialchars will convert them to entities, and they’ll display as such on the Welcome screen, without turning the name bold. strip_tags will remove the tags altogether, and the bare name will be displayed, which seems to me to be a better result, yet it’s not the accepted method.
If that name goes on to be stored in a database one doesn’t want the tags attached to it for evermore, so presumably strip_tags would have to be applied later anyway ? (Whether strip_tags will remove entities is another matter which I’ve not looked into, yet !).
In short, I can’t see why one would prefer to use htmlspecialchars in the first place. I can follow a rule, of course, but I’m trying to understand why.
These two actually refer to two different problems. Stripping tags in the input data is a part of input sanitization where you apply rules what can and what can’t end up in your database. It’s the same logic as with numeric inputs where you strip all characters but 0-9 from a string. But escaping the HTML is a part of output sanitization, and should always be applied to the content, regardless of the input sanitization logic.
So, strip_tags if you don’t want the tags in the database, but always escape your HTML prior to output. You really don’t want an XSS-friendly website.
You use htmlspecialchars where the input is allowed to contain HTML that you want to be able to display as such in the web page (or where the content can contain > and < signs for other purposes). If you don’t want any HTML that they input to be displayed then use strip_tags to remove it.
That book is the best I have seen on PHP security. There are apparently a few things it doesn’t cover that have surfaced since it was written but hopefully Chris will eventually produce a second edition that adds those.