Ultimate TidyInput function

I would like to create the ultimate TidyInput function. By “TidyInput” I mean a function that takes a raw $_POST array and makes it safe for further processing. I use this function in scripts that have to process form fields, it’s usually the first thing I do before trying to make use of the form fields.

In the past I’ve used something like this:

function TidyInput($field) {
  if (get_magic_quotes_gpc()) $field = stripslashes($field);
  $field = strip_tags(trim($field));
  return $field;
}

and then in the calling script I would use this:

foreach ($_POST as $key => $val) {
  $tidyinput = TidyInput($val);
  $_POST[$key] = $tidyinput;
}

Lately I’ve changed the script to this:

function TidyInput(&$data) {
  foreach ($data as $key => $value) {
    if (!is_array($value)) {
      if (get_magic_quotes_gpc()) $value = stripslashes($value);
      $data[$key] = trim(htmlspecialchars($value, ENT_NOQUOTES, 'UTF-8'));
    } else {
      TidyInput($value);
      $data[$key] = $value;
    }
  }
}

and called it like this:

TidyInput($_POST);

As you can see my improved function handles arrays, which the previous function didn’t do.

I use this function in scripts that then go on to do something with the form fields, usually either email the form fields somewhere or save then in a MySQL database. When saving in a database I always use mysql_real_escape_string() on the field before using it in a query.

In some cases the scripts are public (i.e. a contact form where the form fields are emailed somewhere) and other times the scripts may be private (i.e. a password protected area that only the site admin can log into and update product details that are stored in a database). After using TidyInput my scripts then do further checking for required fields, numeric fields, date fields, etc, so I don’t need TidyInput() to do any checking like this.

My question is; am I doing everything to protect against invalid input that might cause problems or even attacks? Is this a good general-purpose function for tidying up input? Have I missed anything?

One area where my TidyInput() function falls down is when I want to allow the user to enter HTML (i.e. a product description field which is a textarea field and I use something like CKEditor to allow the user to enter HTML). In this case I don’t want to pass the textarea field through TidyInput() because if I do then htmlspecialchars() will convert angle brackets to < and > and generally muck up the HTML. Would be nice if I could somehow specify fields that are allowed to hold HTML, any ideas?

I can speak from experience of deciding to do similar for a few months - and then regretting it for years later.

Take a read of this old thread and [URL=“http://www.sitepoint.com/forums/showthread.php?689776-Sanitizer-code-efficient”]this one and there might be something in [URL=“http://www.sitepoint.com/forums/showthread.php?729029-Best-way-to-Cleanse-Data&highlight=function+to+cleanse+incoming+data”]this one too and finally [URL=“http://www.sitepoint.com/forums/showthread.php?616810-Reasonable-Method-Of-Validating-Data&highlight=function+to+cleanse+incoming+data”]this one.

Heres another way of doing things too.

HTH

There are different kinds of “invalid input” depending on how the data is being used. XSS, prevented by htmlspecialchars, is one, and as you noted, SQL injection, prevented by mysql_real_escape_string, is another. Your TidyInput function handles one, very specific kind of invalid input, so it might be better to rename the function to escapeHTML so that it’s clear exactly what it will protect you from.

Also, I think it’s better to keep the raw, original content in the database, and escape HTML only just before you echo the content to a page. The reason is, although outputting to an HTML page is the most common, it may not be the only kind of document you generate. You may need to use your database data to generate JSON, or to send an email, in which case it would be bad if the database data was pre-escaped for HTML.

Thank you both for your comments.

I think that where I’ve gone wrong is that I’m not really following the FIEO (Filter Input, Escape Output) model, instead I’m mixing input and output up a bit. What I do with the input is dependant on what I’m going to use it for (output to a web page, save in a database, email it, etc). So using htmlspecialchars() makes sense when I want to echo the input on a web page but it destroys the original input when I save it in a database, so I shouldn’t be forcing all input through htmlspecialchars(), instead I should only be using that when I output to a web page.

Once again, thanks guys, I think you’ve set me on the right path now!