What's your view of sanitising user input?

Hi,

I’ve read a great deal about sanitising user input that comes into an application, yet there seems to be many interpretations about what “sanitisation” is. Some mix this with validation, some look at is as cleaning the input up ready to be outputted. The way I see it:

  1. “Sanitising” is cleaning the inputs up ready to be used by the application classes.
  2. “Validating” is taking the “sanitised” inputs and checking they fall within the criteria required for the application (i.e. is an email format?)
  3. “Escaping” is cleaning the “sanitised” and “validated” input so that it is ready to be outputted

With this in mind, I’m wondering how you sanitise the input. I want to create a Request Wrapper to house all the inputs, but am not convinced it should contain sanisation functions, rather deal with this as a separate operation outside the Wrapper.

So if you were to build a class to deal solely with sanitisation, what would it consist of and when would you run it in your application? Or would you have a sanitisation script at all?

Sanitizing is exactly what you said - cleaning user input to be used by application. Examples: trimming strings, stripping html tags from strings, removing non alphanumeric characters from strings, casting to int, converting strings to lowercase/uppercase, matching string with regular expressions etc. Basically, sanitizing (I think it’s more often called filtering) means making sure the input contains only allowed characters.

You got the validation right, too. Validation means you make sure the input has correct format or to put it differently validation is checking that the input has a meaning in its context. You already mentioned email address validation. Other examples are validating dates, times, ip addresses, credit card numbers. If the input is a number, you can validate that it belongs to a narrower set of numbers (defined as an array, or just by inequalities - i.e. greater than 5 and less than 200). If the input is a file, you can validate its size, dimensions, file format, mime type etc.

You got the escaping right as well.

To answer your question. There is a filter_var() function in PHP you can use to sanitize inputs. If you were to build a class for that purpose, it would probably make sense to make only an abstract class and then create subclasses extending the abstract class for different cases. I know this is how it’s done in Zend Framework which I use. There is an abstract class and then you have classes called “filters” that extend that class. There is more than 20 filters for different cases I believe.

Also take a look at this: http://php.net/manual/en/function.filter-input.php

Pretty good question and a great answer.

All I can add is the acronym FIEO - Filter input Escape Output and that is something you can google for.

However it is another “touchstone” you can keep in your pocket and rub when things have got really complex, and you want to do a final check.

Filter incoming data against patterns (e.g. email), bounds (Months are between 1 and 12), and white-lists (permitted options from a drop down list).

If it fails you decide what to do. Filter as much as you can.

Escaping is making sure the the data is safe for the next environment you are sending it. e.g.

-outputting it to the users browser NOW (htmlentities) and/or
-storing it in a database NEXT (mysql_real_escape)

Don’t display anything that the user has input without escaping it and then use Mysqli or PDO prepared statements to escape EVERYTHING you store in your database - and then sleep properly at night.

Understand FIEO and you will have got the top 2 attack vectors ( sql injections and XSS )

What can I say. Wasn’t expecting pretty much everything to be covered in 3 posts, thanks a lot guys.

OK, now its clearer to me how sanitisation can be used, I’d be intrigued to know if you think it’s possible to have a sanitisation script that runs before the application starts running. In other words, can you visualise a way that a wrapper could clean all requests automatically before the scripts starts to run?

I’m guessing no, as the filter functions rely on knowing the data type. Short of prefixing variable names to identify data types … hang on, could that work?

I use a method, defined in my BaseObject class (which everything extends from), which looks like this:

public function sanitizeVar($var){
        try {
            if(is_array($var)) {
                foreach($var as $key => $value) {
                    $var[$key] = $this->sanitizeVar($value);
                }
                return $var;
            }

            if(is_int($var)) {
                (int) $var = filter_var($var, FILTER_SANITIZE_NUMBER_INT);
                return $var;
            }

            if(filter_var($var, FILTER_VALIDATE_EMAIL)) {
                $var = filter_var($var, FILTER_SANITIZE_EMAIL);
                return $var;
            }

            if(is_object($var)) {
                return $var;
            }

            if(is_string($var)) {
                (string) $var = filter_var($var, FILTER_SANITIZE_STRING);
                return $var;
            }
 
            throw new Exception("No data type could be determined for entry $var. Please check your input and try again.");
        } catch (Exception $exc) {
            echo $exc->getTraceAsString();
        }

    }

Then, I have a session class which holds data from the Superglobal arrays ($_POST, $_GET, $_SESSION, etc). Instantiation of the Session class instantiates objects for each of the superglobals, like so:

class PostObject extends BaseObject {
    private $postarray = array();

    public function __construct(){
        foreach($_POST as $key => $value){
            $this->postarray[$key] = $this->sanitizeVar($value);
        }
    }

    public function getArray(){
        return $this->postarray;
    }

}

From there, the only thing necessary is to remember to only access your Superglobals via the Session class, and never via the arrays. There are simpler ways to do this, but this is personally the one I’ve implemented. If anyone would like to offer feedback on my implementation, or if you have any questions, please ask.

Just make sure to wash yer hands afterwards!

Interesting. So what method do you use in your applications to retrieve the data, or an individual input from POST lets say?

Almost certainly the code posted above the comment you quoted which uses the session class to retrieve the data from the POST array and sanitise it.

Each of the Superglobals is saved as an object like the PostObject you saw above. Whenever I need to retrieve data from one of those objects, I usually use the following method from the session object:

public function getVar($array, $key) {
        $result = $this->$array->getArray();
        try {
            if(isset($result[$key])) {
                return $result[$key];
            } else {
                throw new Exception("The session variable you're searching for doesn't exist");
            }
        } catch (Exception $exc) {
            echo $exc->getTraceAsString();
        }
    }

I also have a findVar method which does nearly the same thing, except it doesn’t accept an array parameter, and simply loops through each of the superglobals and returns the first key that matches from the array (I obviously don’t use that one as often).