Hi all, trying to gather some information on which are the best practices for allowing HTML to go into the database from arbitrary user input, bearing in mind XSS attacks or any other problems that might be encountered.
Regarding XSS attacks
a simple way I can think of is to produce a list of allowable tags excluding the ‘script’ tag and use the following to sanitize the content:
strip_tags($html, ['a', 'div', 'span','etc...']);
Then that doesn’t cover the case for ‘onclick’ attribute exploits… which you could remove from the content perhaps somewhat hesitantly using a regular expression? But I cannot think of a better way.
And there will be other exploits and pitfalls and better ways of doing things that I’m missing right now and that’s why I’m asking everyone here.
Kind of guessing that the best practice for safely storing arbitrary html input is to not store arbitrary html input.
If you have a choice then investigate some of the markdown processors out there. Users can still format their input and and make pretty pages, while at the same time not having to deal with the complexities of html. And you don’t have to worry about attacks and whatnot.
I’ve never used a markdown parser but I always start my search with well known vendors. I’ve used a lot of PHPLeague packages and it seems they have a markdown parser available.