What "safe" characters do you have in your Whitelist?

I’ve been programming in various languages building fairly stripped down, streamlined web applications. I do a lot of custom work and UX design to ensure that users can do everything they need to do, but balance that with the need to maximize the amount of work someone can get down with the minimum amount of effort. The result is fairly minimal use of textboxes and user input, so I’ve been cruising on an absolutely draconian whitelist. It sanitizes virtually everything except alpha-numeric characters. Beyond that I’ve only included period, @, and space. The whitelist I’ve been using is…

[^.@a-zA-Z0-9 ]

Now, I want to do something less battle axe and more scalpel. I’ve been testing some variations, and will try to get some more stuff posted.

My question is, what IS vs. IS NOT a good idea to include in your whitelist? What do you use, and what do you think are best practices? There’s a lot of stuff about security including admonitions to use whitelists; now I’m much more interested in implementation.

I’ll open with some general stabs. I’ve worked with web security at a solid intermediate level, but I’m no guru so I’m interested in other people’s thoughts. I haven’t found a lot of discussion about what should and should not be in a good whitelist.

Punctuation looks like the most dangerous area, especially the convenient ones on the keyboard which are built into most every programming language including Javascript (and hence is an XSS concern). Semi-colons and dollar signs in particular seem like they should be totally off the board.

On the other hand, letters seem relatively safe. My latest version adds on much of the Latin-1 Supplemental set. I’ve included À (U+00C0) through Ö (U+00D6), then Ø (U+00D8) through ö (U+00F6), and ø (U+00F8) through ÿ (U+00FF).

[^.@a-zA-Z0-9À-ÖØ-öø-ÿ ]

This skips two mathematical looking characters, × (U+00D7) and ÷ (U+00F7). Though I’m not sure if that makes them dangerous or not. I figure since we’re talking security, it’s better to find out that you sanitized too much, than to have a sensitive application hacked.

Here’s a character reference, to make things easier

You’ll note I haven’t used any of the fancier regular expression tools. For one example, I’m not eliminating (X)HTML elements etc. Not sure what if anything, anyone thinks about that. I just eliminated < and > which takes care of that specific issue.

Please shred away! :smiley:

Well yes, but it should be obvious that binary data shouldn’t have nulls stripped from.

There is one unsafe character that best be filtered out pretty much always though: null.

No, they are the characters you need to escape in order to not have text content getting mixed up with HTML.

If you validate all of the input fields properly to make sure that what they contain is valid data for that field then you eliminate most opportunitues for any XXS as well as a whole load of garbage from your database right at the start.

The only fields that should by XSS vulnerable are those that can validly contain that type of code and then it is just a matter of making sure that the data is kept separate from the code. If you are using PHP then htmlentities will escape everything necessary before writing out text into your HTML. The two main characters it takes care of are < and & (it might also convert > but that isn’t essential as a > that doesn’t have a matching < can’t be a HTML tag). If you are not outputting to HTML then you don’t need to do those escapes (and you certainly wouldn’t do them before writing to a database as that would break the text in so far as using it anywhere other than in HTML is concerned (plus making the content bigger).

  1. I agree that you don’t have to sanitize/filter the dangerous items and that escaping characters is a perfectly valid alternative. However, you’ll still note that this begs the underlying question. What are safe characters do you choose NOT to escape. I’d be very interested to see any strategies you have for escaping characters that are not on a whitelist. I’ve seen lookup style escaping performed, but of necessity this is a blacklist technique, because if the character isn’t included in your table of lookups then you can’t escape it. That’s why I’d decided to figure out the whitelist first; dealing with escaping seems like an awful lot to bite off for one round of testing. But if you know of something, I’d be happy to take a look at it.

  2. Absolutely. You should take a look at the context of what input you’re working with. In fact, you literally can’t perform validation without taking a look at the context.

  3. The filters are interesting, but they don’t seem to change any of the fundamental dynamics of the issue, nor do they really provide an alternative solution. The filter simply puts an extra layer of abstraction between you and the regex; the filter is building the regex for you. My question is how to do that securely and in an internationalized (i18n) way. That’s actually why I prefer building the regex manually in this instance because it gives you more control. For example, if I employ both the high and low flags, then you are right back to [^a-zA-Z0-9]. The filter gives you a abstraction to hide that fact away, but that looks like what’s happening behind the scenes. Nor do I see any other filters or flags that really jump out at me as a way to achieve what I want. Do you know how to move beyond that to support French, German, Greek, Chinese, or Russian? Building the regex manually, through filters, or otherwise?

Interesting. So the < and the & are the primary items you target to prevent XSS? Are you implying > as well, or is there a reason you don’t include that?

If you’re targeting entities, would you be concerned about backslash? I know that backslash is sometimes used to escape characters, though mostly in programming constructs. Regular expressions use \d, \s, and similar. Also some characters are escaped this way, \ ,
for tab and new line under Windows; in ASP.NET the same technique can be used on any Unicode character (e.g. \u0061 for the code point “a”), though I don’t know if that’s tech-specific. If the compiler is translating that into “a” during output, then it may be specific to a .NET application, but if it’s just passing that as a string, then any browser will process it correctly too (I’ve used this technique to output tabs and new lines in cross-platform solutions when \ didn’t work).

Also, escaping is a blacklist technique. Do you also use a whitelist to sanitize after you’ve escaped the things you want to save? What’s your philosophy or strategy regarding protecting the application from user input?

PHP now has both filter_var and [url=“http://php.net/manual/en/function.filter-input.php”]filter_input to which you can apply [url=“http://www.php.net/manual/en/filter.filters.php”]different types of filters to sanitise or validate the contents in reliable ways.

For example:


$name = filter_input(INPUT_POST, 'name', FILTER_SANITIZE_STRING,
    array('flags' => FILTER_FLAG_ENCODE_HIGH)
);
$email = filter_input(INPUT_POST, 'email', FILTER_VALIDATE_EMAIL);

Not if the data is binary though rather than text. In non-text data any character is just as likely and as valid as any other.

That all depends on what use you are making of the content. The characters that need to be escaped depend on just what characters would be misinterpreted if they weren’t escaped. So any character in the data that can’t be misinterpreted doesn’t need to be escaped.

For example if you are using PDO or mysqli and prepared statements for your database accesses then there is no need to escape anything when inserting it into the database because the database command is completely separate from the data and one can’t be confused for the other.

If you are outputting information into a web page then the < and & characters need to be escaped as < and & respectively so that the content doesn’t get misinterpreted as HTML tags and entities.

I happen to use semicolons often in my writing.

There is no need to make a catch all filter. If you use the proper escape function for a context, there is no risk.

I’m looking at an appropriate whitelist for more text / narrative based user input. I want to sanitize not an expected integer, or a hex code, but a comment, a blog post, a forum thread. Text of that nature.

While I don’t “need” it currently, I would also like something supportive of multiple languages, hence my interest in supplemental and extended Latin as well as other scripts.

Semicolons are problematic because they’re a common statement terminator in many languages (including SQL). Allowing them opens the opportunity for a hacker to execute not just one but multiple statements. While SQL parameters are probably the most important way to protect your database, semi-colons are at the heart of the textbook SQL Injection attack.

Dollar signs are another common symbol used in a number of languages. It’s used to declare variables in PHP, though I’m not sure how that might be used against an application. More significantly, it’s used as a short-hand function name in jQuery and several other popular Javascript libraries. It’s one more tool for enacting XSS attacks.

Both characters are also fairly limited usefulness for a blog or similar scenario. Semicolons are rarely used and are unlikely to cause a loss of meaning when sanitized. Dollar signs may be confusing when removed, they’ll simply leave an integer or decimal with two significant digits following the decimal point; however, I also suspect that it comes up less than semi-colon. The only problem I can envision in either case is for a programming code sample, then you might need a different strategy. Run the input through a dictionary swap for things you want to save, or something.

It depends on what it is you are trying to protect. Making a “catch all” is impractical. Blocking semicolons and dollar signs? What for? Some context into what it is you are trying to protect would help.