I’ve been programming in various languages building fairly stripped down, streamlined web applications. I do a lot of custom work and UX design to ensure that users can do everything they need to do, but balance that with the need to maximize the amount of work someone can get down with the minimum amount of effort. The result is fairly minimal use of textboxes and user input, so I’ve been cruising on an absolutely draconian whitelist. It sanitizes virtually everything except alpha-numeric characters. Beyond that I’ve only included period, @, and space. The whitelist I’ve been using is…
[^.@a-zA-Z0-9 ]
Now, I want to do something less battle axe and more scalpel. I’ve been testing some variations, and will try to get some more stuff posted.
My question is, what IS vs. IS NOT a good idea to include in your whitelist? What do you use, and what do you think are best practices? There’s a lot of stuff about security including admonitions to use whitelists; now I’m much more interested in implementation.
I’ll open with some general stabs. I’ve worked with web security at a solid intermediate level, but I’m no guru so I’m interested in other people’s thoughts. I haven’t found a lot of discussion about what should and should not be in a good whitelist.
Punctuation looks like the most dangerous area, especially the convenient ones on the keyboard which are built into most every programming language including Javascript (and hence is an XSS concern). Semi-colons and dollar signs in particular seem like they should be totally off the board.
On the other hand, letters seem relatively safe. My latest version adds on much of the Latin-1 Supplemental set. I’ve included À (U+00C0) through Ö (U+00D6), then Ø (U+00D8) through ö (U+00F6), and ø (U+00F8) through ÿ (U+00FF).
[^.@a-zA-Z0-9À-ÖØ-öø-ÿ ]
This skips two mathematical looking characters, × (U+00D7) and ÷ (U+00F7). Though I’m not sure if that makes them dangerous or not. I figure since we’re talking security, it’s better to find out that you sanitized too much, than to have a sensitive application hacked.
Here’s a character reference, to make things easier
You’ll note I haven’t used any of the fancier regular expression tools. For one example, I’m not eliminating (X)HTML elements etc. Not sure what if anything, anyone thinks about that. I just eliminated < and > which takes care of that specific issue.
Please shred away!