Simplifying Test Data Generation with Faker

Testing is an iterative part of the development process that we carry out to ensure the quality of our code. A large portion of this entails writing test cases and testing each unit of our application using random test data.

Actual data for our application comes in when we release it to production, but during the development process we need fake data similar to real data for testing purposes. The popular open source library Faker provides us with the ability to generate different data suitable for a wide range of scenarios.

Here we’ll focus on generating random test data using Faker for testing our test cases.

How Faker Works

Faker comes with a set of built-in data providers which can be easily accessed to generate test data. Additionally, we can define our own test data types making it highly extensible. But first, let’s look at a basic example that shows how Faker works:

<?php
require "vendor/autoload.php";

$faker = FakerFactory::create();

// generate data by accessing properties
for ($i = 0; $i < 10; $i++) {
    echo "<p>" . $faker->name . "</p>";
    echo "<p>" . $faker->address . "</p>";
}

The example assumes Faker was installed using Composer and uses the Composer autoloader to make the class definitions available. You can also use Faker by cloning it from its GitHub repository and using its included autoloader if you’re not using Composer.

To use Faker, we first need to obtain an instance from FakerFactory. All of the default data providers are loaded automatically into the $faker object. Then we generate random data just by calling a formatter name. The final output of the above code will list ten random person names and addresses from the available data sources.

Providers are classes that hold the data and the necessary data generation formatter methods. Formatters are methods inside provider classes that generates test data directly from a source or using a combination of other formatters. Faker comes with the following built-in providers: Person, Address, PhoneNo, Company, Lorem, Internet, DateTime, Miscellaneous, and UserAgent.

Let’s take a look at the Person class to get a better understanding of what the structure of a Faker provider looks like.

<?php
namespace FakerProvider;

class Person extends FakerProviderBase
{
    protected static $formats = array(
           "{{firstName}} {{lastName}}",
    );
    protected static $firstName = array("John", "Jane");
    protected static $lastName = array("Doe");

    public function name() {
           $format = static::randomElement(static::$formats);
        return $this->generator->parse($format);
    }

    public static function firstName() {
        return static::randomElement(static::$firstName);
    }
}

Person acts as the provider, extending the base provider class FakerProviderBase. firstName() is a formatter which retrieves a random data element directly from the internal firstName data array. Formatters may combine other formatters and return the data in a specific format as well, which is what name() does. All of the providers and formatters work based on this structure.

The built-in providers contain basic formatters with very limited data. If you are using Faker to automate the process of generating test data, you may need to create your own data sets and formatter implementations by extending the base providers.

<?php
namespace FakerProvider;

class Student extends FakerProviderPerson
{
    protected static $formats = array(
        "{{lastName}} {{firstName}}",
        "{{firstName}} {{lastName}}"
    );
    protected static $firstName = array("Mark", "Adam");
    protected static $lastName = array("Clark", "Stewart");
    private static $prefix = array("Mr.", "Mrs.", "Ms.", "Miss", "Dr.");

    public static function prefix() {
        return static::randomElement(static::$prefix);
    }

    public static function firstName() {
        return static::prefix() . " " .
            static::randomElement(static::$firstName);
    }
}

Since Student is not a default provider, we have to manually add it to the Faker generator. If the same method is defined on more than one provider, the latest added provider takes precedence over the others.

<?php
$faker = new FakerGenerator();
$faker->addProvider(new FakerProviderStudent($faker));

echo $faker->firstName; // invokes Student::firstName()

A More Complex Example

The built-in providers contain basic data types for testing, but real world use cases are often require more complexity. In such situations we need to create our own data providers and custom data sets to automate the testing procedure. Let’s build a Faker provider from scratch catering to a real world scenario.

Assume we’re developing an email marketing service which sends thousands of emails containing various kinds of advertisements from clients. What data fields will we need for testing? Basically we need a to email, subject, name. and content to test an email.

Let’s also assume there are three types of email templates:

  • advertisement with text/HTML based content
  • advertisements with a single full-size image
  • advertisements containing links to other sites

The content field will be one of these templates, so we’ll also need the testing fields text content, image, and links.

Having understood the main requirements, we can create the provider as follows:

<?php
namespace FakerProvider;

class EmailTemplate extends FakerProviderBase
{
    protected static $formats = array(
        '<p>Hello {{name}} </p>
        <p>{{text}}</p>
        <p>Newsletter by Exmaple</p>',

        '<p>{{adImage}}</p>
        <p>Newsletter by Exmaple</p>',

        '<p>Hello {{name}} </p>
        <p>{{link}}</p>
        <p>{{link}}</p>
        <p>{{link}}</p>
        <p>Newsletter by Exmaple</p>'
    );
    protected static $toEmail = array(
        "test@example.com",
        "test1@example.com"
    );
    protected static $name = array("Mark", "Adam");
    protected static $subject = array("Subject 1", "Subject 2");
    protected static $adImage = array("img1.png", "img2.jpg");
    protected static $link = array("link1", "link2");
    protected static $text = array("text1", "text2");

    public static function toEmail() {
        return static::randomElement(static::$toEmail);
    }

    public static function name() {
        return static::randomElement(static::$name);
    }
    
    public function template() {
        $format = static::randomElement(static::$formats);
        return $this->generator->parse($format);
    }
}

We have defined three formats to match the three different templates, and then we created data sets for each of the fields we are using in the test data generation process. All the fields should contain formatter methods similar to toEmail() and name() in the above code. The template() method takes one of the formats randomly and fills the necessary data using formatters.

We can get the test data using the code below and passing it to our email application.

<?php
$faker = new FakerGenerator();
$faker->addProvider(new FakerProviderEmailTemplate($faker));

$email = $faker->toEmail; 
$subject =  $faker->subject;
$template = $faker->template;

The advantage of the above technique is that we can test all three formats randomly using a single provider with direct formatter function calling. But what if one these format methods is broken or we have a scenario where we need to test only one of the formats continuously? Commenting out or removing the formats manually isn’t an appealing option.

In this case I would recommend creating separate implementations for each format. We can define a base EmailTemplate class with one format and all of the formatter methods, and then create three different child implementations by extending it. Child classes will only contain the unique format and the formatters will be inherited from the parent class. We can then use each email template differently by loading it separately to the Faker generator.

Consistency of Test Data

Generally we’ll run tests many times and record the data and results. We check the database or log files to figure out what the respective data was when an error is encountered. Once we’ve fixed the error, it is important to run the test cases with the same data that caused the error. Faker uses seeding so we can replicate the previous data by seeding it’s random number generator.

Consider the following code:

<?php
$faker = FakerFactory::create();
$faker->seed(1000);
$faker->name;

We’ve assigned a seed value of 1000. Now, no matter how many times we execute the above script, the names will be the same sequence of random values for all the tests.

In application testing you should assign a seed for each test case and record in your logs. Once the errors are fixed, you can get the seed numbers of the test cases which caused the errors and test it again with the same data using the seed number to make it consistent.

Conclusion

Generating test data is something you should automate to prevent wasting time unnecessarily. Faker is a simple and powerful solution for generating random test data. The real power of Faker comes with its ability to extend default functionalities to suit more complex implementations.

So what is your test data generation strategy? Do you like to use Faker to automate test data generation? Let me know through the comments section.

Image via Fotolia

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://chrisoconnell.info Chris O’Connell

    Where is the Faker library located? I don’t see a link in the article. And why did they not call it Phaker? lol.
    Thanks :-)

  • http://makdiose.com Mak Diose

    Wow, that is so useful, before I always ask for sample data from the client to start with which is sometimes took time to request. But with these I can now just replicate the desired data structure. Time saver.Perfect for projects. Thanks for the links guys.

    • http://www.innovativephp.com Rakhitha Nimesh

      Hello Mark

      I am glad that you liked the tutorial. Did you use any library or framework previously or just did the data generation manually?

      Thanks