Sanity with sanitizing

I am curious as to the best methods that are employed with the sanitation of inputs via

$_POST
$_GET
$_FILES

methods of entry on to a server.

I have been reviewing code for several CMS systems and each one has its own method of sanitizing these parts of a server to guard against breaking the PHP script.

Every method I have searched for on the subject is relating to mysql injection. It is not what I am concerned with, it appears to me that theirs a lack of advice or guidance on this subject and if you have no access to your PHP config file on a web host, where do you stand with sanitizing those inputs against someone attempting to break the script let alone attempting injection.

All of those input types are user enterable and so the first thing that needs to be done with each is to VALIDATE their content.

Only fields that do not come directly from the user need sanitizing (just in case someone tampered with them).

When validating inputs:

  1. use a built in function if one exists eg. is_numeric($_POST[‘num’])
  2. use validation filters where they exist eg. filter_var($_POST[‘email’], FILTER_VALIDATE_EMAIL)
  3. use a regular expression to check that the input at least looks reasonable if neither of the above are available

Then once you are sure that the input only contains content that might be valid perform any other validations such as comparing to other fields.

For sanitising fields that are already supposed to be valid use the sanitizing filters or a regular expression to strip out characters not valid in that field. Provided it actually is valid the field should be unchanged. If it was tampered with then at least what is left will do less harm.

Yeah, I looked at filter_var()

not really convinced that it would be effective at “Protection” just filtering to ensure that something given is meeting an expected criteria.

For example…

$safe_username = filter_var( $_POST[‘username’] , FILTER_VALIDATE_EMAIL );

if the username was an email address like many sites use these days as a username, then the input would only be checked to validate that it met the requirement of an email address.

I really don’t see how this can be an advantage above a regular expression approach.

Two advantages over a regular expression.

  1. The filter has been thoroughly tested so you know it will work. A regular expression might not quite be matching on what you expect in every case - particularly with longer ones. So you can never be quite as certain that an invalid value will be properly detected by your regular expression.

  2. Some regular expressions are really long. For example the following validates an email address properly according to the email address standards:

^[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|"[^\\\\\\x80-\\xff\
\\015"]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015"]*)*")[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:\\.[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|"[^\\\\\\x80-\\xff\
\\015"]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015"]*)*")[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)*@[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:\\.[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)*|(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|"[^\\\\\\x80-\\xff\
\\015"]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015"]*)*")[^()<>@,;:".\\\\\\[\\]\\x80-\\xff\\000-\\010\\012-\\037]*(?:(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)|"[^\\\\\\x80-\\xff\
\\015"]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015"]*)*")[^()<>@,;:".\\\\\\[\\]\\x80-\\xff\\000-\\010\\012-\\037]*)*<[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:@[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:\\.[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)*(?:,[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*@[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:\\.[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)*)*:[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)?(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|"[^\\\\\\x80-\\xff\
\\015"]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015"]*)*")[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:\\.[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|"[^\\\\\\x80-\\xff\
\\015"]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015"]*)*")[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)*@[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:\\.[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*(?:[^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff]+(?![^(\\040)<>@,;:".\\\\\\[\\]\\000-\\037\\x80-\\xff])|\\[(?:[^\\\\\\x80-\\xff\
\\015\\[\\]]|\\\\[^\\x80-\\xff])*\\])[\\040\	]*(?:\\([^\\\\\\x80-\\xff\
\\015()]*(?:(?:\\\\[^\\x80-\\xff]|\\([^\\\\\\x80-\\xff\
\\015()]*(?:\\\\[^\\x80-\\xff][^\\\\\\x80-\\xff\
\\015()]*)*\\))[^\\\\\\x80-\\xff\
\\015()]*)*\\)[\\040\	]*)*)*>)$

Consider how less readable that is and how much more error prone it is compared to the filter approach (particularly when displayed here where only a small fraction of the expression is visible at any time - you might need to press the down arrow a dozen or more times just to see the whole regular expression).

Not what I was trying to say, I take your point but I am digging at is “Alone” filter_var according to some blogs and sites that I have read recently about the filter_var function, they are not as robust as people are putting their faith in to.

What I need to do is re-write some CMS code to make it secure again. The main problem being that the original developer has vanished completely, website and all from the WWW and whilst the number of security measures in place do not meet expectations, the actual vulnerability’s is very low. Looking at other Blogs and CMS’s on how they do things, a vast majority do not employ any filter_var() methods and I am wondering why seasoned projects and coders are not using them…

So whilst filter_var() may provide part of the answer to filtering data type, it doesn’t sanitize against bad data otherwise it would be called sanitize_var() instead.