PHP
Article
By Christopher Pitt

Writing Async Libraries – Let’s Convert HTML to PDF

By Christopher Pitt

This article was peer reviewed by Thomas Punt. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!


I can barely remember a conference where the topic of asynchronous PHP wasn’t discussed. I am pleased that it’s so frequently spoken about these days. There’s a secret these speakers aren’t telling, though…

Making asynchronous servers, resolving domain names, interacting with file systems: these are the easy things. Making your own asynchronous libraries is hard. And it’s where you spend most of your time!

Vector image of parallel racing arrows, indicating multi-process execution

The reason those easy things are easy is because they were the proof of concept – to make async PHP competitive with NodeJS. You can see this in how similar their early interfaces were:

var http = require("http");
var server = http.createServer();

server.on("request", function(request, response) {
    response.writeHead(200, {
        "Content-Type": "text/plain"
    });

    response.end("Hello World");
});

server.listen(3000, "127.0.0.1");

This code was tested with Node 7.3.0

require "vendor/autoload.php";

$loop = React\EventLoop\Factory::create();
$socket = new React\Socket\Server($loop);
$server = new React\Http\Server($socket);

$server->on("request", function($request, $response) {
    $response->writeHead(200, [
        "Content-Type" => "text/plain"
    ]);

    $response->end("Hello world");
});

$socket->listen(3000, "127.0.0.1");
$loop->run();

This code was tested with PHP 7.1 and react/http:0.4.2

Today, we’re going to look at a few ways to make your application code work well in an asynchronous architecture. Fret not – your code can still work in a synchronous architecture, so you don’t have to give anything up to learn this new skill. Apart from a bit of time…

You can find the code for this tutorial on Github. I’ve tested it with PHP 7.1 and the most recent versions of ReactPHP and Amp.

Promising Theory

There are a few abstractions common to asynchronous code. We’ve already seen one of them: callbacks. Callbacks, by their very name, describe how they treat slow or blocking operations. Synchronous code is fraught with waiting. Ask for something, wait for that thing to happen.

So, instead, asynchronous frameworks and libraries can employ callbacks. Ask for something, and when it happens: the framework or library will call your code back.

In the case of HTTP servers, we don’t preemptively handle all requests. We don’t wait around for requests to happen, either. We simply describe the code that should be called, should a request happen. The event loop takes care of the rest.

A second common abstraction is promises. Where callbacks are hooks waiting for future events, promises are references to future values. They look something like this:

readFile()
    ->then(function(string $content) {
        print "content: " . $content;
    })
    ->catch(function(Exception $e) {
        print "error: " . $e->getMessage();
    });

It’s a bit more code than callbacks alone, but it’s an interesting approach. We wait for something to happen, and then do another thing. If something goes wrong, we catch the error and respond sensibly. This may look simple, but it’s not spoken about nearly enough.

We’re still using callbacks, but we’ve wrapped them in an abstraction which helps us in other ways. One such benefit is that they allow multiple resolution callbacks…

$promise = readFile();
$promise->then(...)->catch(...);

// ...let's add logging to existing code

$promise->then(function(string $content) use ($logger) {
    $logger->info("file was read");
});

There’s something else I’d like us to focus on. It’s that promises provide a common language – a common abstraction – for thinking about how synchronous code can become asynchronous code.

Let’s take some application code and make it asynchronous, using promises…

Making PDF Files

It’s common for applications to generate some kind of summary document – be it an invoice or stock list. Imagine you have an e-commerce application which processes payments through Stripe. When customers purchase something, you’d like them to be able to download a PDF receipt of that transaction.

There are many ways you could do this, but a really simple approach would be to generate the document using HTML and CSS. You could convert that to a PDF document, and allow the customer to download it.

I needed to do something similar recently. I discovered that there aren’t many good libraries that support this kind of operation. I couldn’t find a single abstraction which would allow me to switch between different HTML → PDF engines. So I started to build my own.

I began thinking about what I needed the abstraction to do. I settled on an interface quite like:

interface Driver
{
    public function html($html = null);
    public function size($size = null);
    public function orientation($orientation = null);
    public function dpi($dpi = null);
    public function render();
}

For the sake of simplicity, I wanted all but the render method to function as both getters and setters. Given this set of expected methods, the next thing to do was to create an implementation, using one possible engine. I added domPDF to my project, and set about using it:

class DomDriver extends BaseDriver implements Driver
{
    private $options;

    public function __construct(array $options = [])
    {
        $this->options = $options;
    }

    public function render()
    {
        $data = $this->data();
        $custom = $this->options;

        return $this->parallel(
            function() use ($data, $custom) {
                $options = new Options();

                $options->set(
                    "isJavascriptEnabled", true
                );

                $options->set(
                    "isHtml5ParserEnabled", true
                );

                $options->set("dpi", $data["dpi"]);

                foreach ($custom as $key => $value) {
                    $options->set($key, $value);
                }

                $engine = new Dompdf($options);

                $engine->setPaper(
                    $data["size"], $data["orientation"]
                );

                $engine->loadHtml($data["html"]);
                $engine->render();

                return $engine->output();
            }
        );
    }
}

I’m not going to go into the specifics of how to use domPDF. I think the docs do a good enough job of that, allowing me to focus on the async bits of this implementation.

We’ll look at the data and parallel methods in a bit. What’s important about this Driver implementation is that it gathers the data (if any have been set, otherwise defaults) and custom options together. It passes these to a callback we’d like to be run asynchronously.

domPDF isn’t an asynchronous library, and converting HTML → PDF is a notoriously slow process. So how do we make it asynchronous? Well, we could write a completely asynchronous converter, or we could use an existing synchronous converter; but run it in a parallel thread or process.

That’s what I made the parallel method for:

abstract class BaseDriver implements Driver
{
    protected $html = "";
    protected $size = "A4";
    protected $orientation = "portrait";
    protected $dpi = 300;

    public function html($body = null)
    {
        return $this->access("html", $html);
    }

    private function access($key, $value = null)
    {
        if (is_null($value)) {
            return $this->$key;
        }

        $this->$key = $value;
        return $this;
    }

    public function size($size = null)
    {
        return $this->access("size", $size);
    }

    public function orientation($orientation = null)
    {
        return $this->access("orientation", $orientation);
    }

    public function dpi($dpi = null)
    {
        return $this->access("dpi", $dpi);
    }

    protected function data()
    {
        return [
            "html" => $html,
            "size" => $this->size,
            "orientation" => $this->orientation,
            "dpi" => $this->dpi,
        ];
    }

    protected function parallel(Closure $deferred)
    {
        // TODO
    }
}

Here I implemented the getter-setter methods, figuring that I could reuse them for the next implementation. The data method acts as shortcut for collecting various document properties into an array, making them easier to pass to anonymous functions.

The parallel method started to get interesting:

use Amp\Parallel\Forking\Fork;
use Amp\Parallel\Threading\Thread;

// ...

protected function parallel(Closure $deferred)
{
    if (Fork::supported()) {
       return Fork::spawn($deferred)->join();
    }

    if (Thread::supported()) {
        return Thread::spawn($deferred)->join();
    }

    return null;
}

I’m a huge fan of the Amp project. It’s a collection of libraries supporting asynchronous architecture, and they’re key supporters of the async-interop project.

One of their libraries is called amphp/parallel, and it supports multi-threaded and multi-process code (via Pthreads and Process Control extensions). Those spawn methods return Amp’s implementation of promises. That means the render method can be used like any other promise-returning method:

$promise = $driver
    ->html("<h1>hello world</h1>")
    ->size("A4")->orientation("portrait")->dpi(300)
    ->render();

$results = yield $promise;

This code is a bit loaded. Amp also provides an event loop implementation and all the helper code to be able to convert ordinary PHP generators to coroutines and promises. You can read about how this is even possible, and what it has to do with PHP’s generators in another post I’ve written.

The returned promises are also becoming standardized. Amp returns implementations of the Promise spec. It deviates slightly from the code I showed above, but still performs the same function.

Generators work like coroutines from languages that have them. Coroutines are interruptible functions, which means they can be used to do short bursts of work, and then pause while they wait for something. While paused, other functions can use the system resources.

In practice, this looks like:

use AsyncInterop\Loop;

Loop::execute(
    Amp\wrap(function() {
        $result = yield funcReturnsPromise();
    })
);

This looks way more complicated than just writing synchronous code to begin with. But what it allows for is that other things can happen while we would otherwise be waiting for funcReturnsPromise to complete.

Yielding promises is that abstraction I was talking about. It gives us the framework by which we can make functions that return promises. Code can interact with those promises in predictable and understandable ways.

Look at what it would be like to render PDF documents using our driver:

use AsyncInterop\Loop;

Loop::execute(Amp\wrap(function() {
    $driver = new DomDriver();

    // this is an AsyncInterop\Promise...
    $promise = $driver
        ->body("<h1>hello world</h1>")
        ->size("A4")->orientation("portrait")->dpi(300)
        ->render();

    $results = yield $promise;

    // write $results to an empty PDF file
}));

This is less useful than, say, generating PDFs in an asynchronous HTTP server. There’s an Amp library called Aerys which makes these kinds of servers easier to create. Using Aerys, you could create the following HTTP server code:

$router = new Aerys\Router();

$router->get("/", function($request, $response) {
    $response->end("<h1>Hello World!</h1>");
});

$router->get("/convert", function($request, $response) {
    $driver = new DomDriver();

    // this is an AsyncInterop\Promise...
    $promise = $driver
        ->body("<h1>hello world</h1>")
        ->size("A4")->orientation("portrait")->dpi(300)
        ->render();

    $results = yield $promise;

    $response
        ->setHeader("Content-type", "application/pdf")
        ->end($results);
});

(new Aerys\Host())
    ->expose("127.0.0.1", 3000)
      ->use($router);

Again, I’m not going to go into the details of Aerys now. It’s an impressive bit of software, well deserving of it’s own post. You don’t need to understand how Aerys works in order to see how natural our converter’s code looks alongside it.

--ADVERTISEMENT--

My Boss Says “No Async!”

Why go through all this trouble, if you’re unsure how often you’ll be able to build asynchronous applications? Writing this code gives us valuable insight into a new programming paradigm. And, just because we’re writing this code as asynchronous doesn’t mean it can’t work in synchronous environments.

To use this code in a synchronous application, all we need to do is move some of the asynchronous code inside:

use AsyncInterop\Loop;

class SyncDriver implements Driver
{
    private $decorated;

    public function __construct(Driver $decorated)
    {
        $this->decorated = $decorated;
    }

    // ...proxy getters/setters to $decorated

    public function render()
    {
        $result = null;

        Loop::execute(
            Amp\wrap(function() use (&$result) {
                $result = yield $this->decorated
                    ->render();
            })
        );

        return $result;
    }
}

Using this decorator, we can write what appears to be synchronous code:

$driver = new DomDriver();

// this is a string...
$results = $driver
    ->body("<h1>hello world</h1>")
    ->size("A4")->orientation("portrait")->dpi(300)
    ->render();

// write $results to an empty PDF file

It’s still running the code asynchronously (in the background at least), but none of that is exposed to the consumer. You could use this in a synchronous application, and never know what was going on under the hood.

Supporting Other Frameworks

Amp has a particular set of requirements that make it unsuitable for all environments. For example, the base Amp (event loop) library requires PHP 7.0. The parallel library requires the Pthreads extension or the Process Control extension.

I didn’t want to impose these restrictions on everyone, and wondered how I could support a wider range of systems. The answer was to abstract the parallel execution code into another driver system:

interface Runner
{
    public function run(Closure $deferred);
}

I could implement this for Amp as well as for the (less restrictive, albeit much older) ReactPHP:

use React\ChildProcess\Process;
use SuperClosure\Serializer;

class ReactRunner implements Runner
{
    public function run(Closure $deferred)
    {
        $autoload = $this->autoload();

        $serializer = new Serializer();

        $serialized = base64_encode(
            $serializer->serialize($deferred)
        );

        $raw = "
            require_once '{$autoload}';

            \$serializer = new SuperClosure\Serializer();
            \$serialized = base64_decode('{$serialized}');

            return call_user_func(
                \$serializer->unserialize(\$serialized)
            );
        ";

        $encoded = addslashes(base64_encode($raw));

        $code = sprintf(
            "print eval(base64_decode('%s'));",
            $encoded
        );

        return new Process(sprintf(
            "exec php -r '%s'",
            addslashes($code)
        ));
    }

    private function autoload()
    {
        $dir = __DIR__;
        $suffix = "vendor/autoload.php";

        $path1 = "{$dir}/../../{$suffix}";
        $path2 = "{$dir}/../../../../{$suffix}";

        if (file_exists($path1)) {
            return realpath($path1);
        }

        if (file_exists($path2)) {
            return realpath($path2);
        }
    }
}

I’m used to passing around closures to multi-threaded and multi-process workers, because that’s how Pthreads and Process Control work. Using ReactPHP Process objects is entirely different as they rely on exec for multi-process execution. I decided to implement the same closure functionality I was used to. This isn’t essential to asynchronous code – it’s purely an expression of taste.

The SuperClosure library serializes closures and their bound variables. Most of the code here is what you’d expect to find inside a worker script. In fact, the only way (apart from serializing closures) to use ReactPHP’s child process library is to send tasks to a worker script.

Now, instead of loading our drivers with $this->parallel and Amp-specific code, we can pass runner implementations around. As async code, this resembles:

use React\EventLoop\Factory;

$driver = new DomDriver();

$runner = new ReactRunner();

// this is a React\ChildProcess\Process...
$process = $driver
    ->body("<h1>hello world</h1>")
    ->size("A4")->orientation("portrait")->dpi(300)
    ->render($runner);

$loop = Factory::create();

$process->on("exit", function() use ($loop) {
    $loop->stop();
});

$loop->addTimer(0.001, function($timer) use ($process) {
    $process->start($timer->getLoop());

    $process->stdout->on("data", function($results) {
        // write $results to an empty PDF file
    });
});

$loop->run();

Don’t be alarmed by how different this ReactPHP code looks from the Amp code. ReactPHP doesn’t implement the same coroutine foundation as Amp does. Instead, ReactPHP favors callbacks for most things. This code is still just running the PDF conversion in parallel, and returning the resulting PDF data.

With runners abstracted, we can use any asynchronous framework we’d like, and we can expect the abstractions of that framework to be returned by the driver we’re using.

Can I use This?

What started out as an experiment became a multi-driver, multi-runner HTML → PDF library; called Paper. It’s like the HTML → PDF equivalent of Flysystem, but it’s also a good example of how to write asynchronous libraries.

As you try to make async PHP applications, you’re going to find gaps in the library ecosystem. Don’t be discouraged by these! Instead, take the opportunity to think about how you’d make your own asynchronous libraries, using the abstractions ReactPHP and Amp provide.

Have you built an interesting async PHP application or library recently? Let us know in the comments.

  • Xu Ding

    Hi , thanks for the article.

    But from user’s point of view, they will still have to wait until the request is finished?

    • Chris

      That’s mostly true. The kind of higher-concurrency that async and/or parallel execution cause isn’t about making things faster for the individual user, but rather allowing more users at the same time. Although, if the system has fewer resources tied up in synchronous waiting, they could be used to speed up the processing of other tasks. That’s much harder to measure though…

      • Xu Ding

        Thanks for answering. Guess for my case, should just a worker queue and notify user when the job is done.

  • Thank you Christopher, for such a nice article and motivating developers to think out of the box, and start working on making their async application with PHP.

    • Chris

      You’re welcome! :)

Recommended
Sponsors
Get the latest in PHP, once a week, for free.