PHP - - By Christopher Pitt

ReactJS in PHP: Writing Compilers Is Easy and Fun!

I used to use an extension called XHP. It enables HTML-in-PHP syntax for generating front-end markup. I reached for it recently, and was surprised to find that it was no longer officially supported for modern PHP versions.

So, I decided to implement a user-land version of it, using a basic state-machine compiler. It seemed like it would be a fun project to do with you!

The code for this tutorial can be found on Github.

Abstract image of blocks coming together

Creating Compilers

Many developers avoid writing their own compilers or interpreters, thinking that the topic is too complex or difficult to explore properly. I used to feel like that too. Compilers can be difficult to make well, and the topic can be incredibly complex and difficult. But, that doesn’t mean you can’t make a compiler.

Making a compiler is like making a sandwich. Anyone can get the ingredients and put it together. You can make a sandwich. You can also go to chef school and learn how to make the best damn sandwich the world has ever seen. You can study the art of sandwich making for years, and people can talk about your sandwiches in other lands. You’re not going to let the breadth and complexity of sandwich-making prevent you from making your first sandwich, are you?

Compilers (and interpreters) begin with humble string manipulation and temporary variables. When they’re sufficiently popular (or sufficiently slow) then the experts can step in; to replace the string manipulation and temporary variables with unicorn tears and cynicism.

At a fundamental level, compilers take a string of code and run it through a couple of steps:

  1. The code is split into tokens – meaningful characters and sub-strings – which the compiler will use to derive meaning. The statement if (isEmergency) alert("there is an emergency") could be considered to contain tokens like if, isEmergency, alert, and "there is an emergency"; and these all mean something to the compiler.

    The first step is to split the entire source code up into these meaningful bits, so that the compiler can start to organize them in a logical hierarchy, so it knows what to do with the code.

  2. The tokens are arranged into the logical hierarchy (sometimes called an Abstract Syntax Tree) which represents what needs to be done in the program. The previous statement could be understood as “Work out if the condition (isEmergency) evaluates to true. If it does, run the function (alert) with the parameter ("there is an emergency")”.

Using this hierarchy, the code can be immediately executed (in the case of an interpreter or virtual machine) or translated into other languages (in the case of languages like CoffeeScript and TypeScript, which are both compile-to-Javascript languages).

In our case, we want to maintain most of the PHP syntax, but we also want to add our own little bit of syntax on top. We could create a whole new interpreter…or we could preprocess the new syntax, compiling it to syntactically valid PHP code.

I’ve written about preprocessing PHP before, and it’s my favorite approach to adding new syntax. In this case, we need to write a more complex script; so we’re going to deviate from how we’ve previously added new syntax.

Generating Tokens

Let’s create a function to split code into tokens. It begins like this:

function tokens($code) {
    $tokens = [];

    $length = strlen($code);
    $cursor = 0;

    while ($cursor < $length) {
        if ($code[$cursor] === "{") {
            print "ATTRIBUTE STARTED ({$cursor})" . PHP_EOL;
        }

        if ($code[$cursor] === "}") {
            print "ATTRIBUTE ENDED ({$cursor})" . PHP_EOL;
        }

        if ($code[$cursor] === "<") {
            print "ELEMENT STARTED ({$cursor})" . PHP_EOL;
        }

        if ($code[$cursor] === ">") {
            print "ELEMENT ENDED ({$cursor})" . PHP_EOL;
        }

        $cursor++;
    }
}

$code = '
    <?php

    $classNames = "foo bar";
    $message = "hello world";

    $thing = (
        <div
            className={() => { return "outer-div"; }}
            nested={<span className={"nested-span"}>with text</span>}
        >
            a bit of text before
            <span>
                {$message} with a bit of extra text
            </span>
            a bit of text after
        </div>
    );
';

tokens($code);

// ELEMENT STARTED (5)
// ELEMENT STARTED (95)
// ATTRIBUTE STARTED (122)
// ELEMENT ENDED (127)
// ATTRIBUTE STARTED (129)
// ATTRIBUTE ENDED (151)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173)
// ELEMENT STARTED (174)
// ATTRIBUTE STARTED (190)
// ATTRIBUTE ENDED (204)
// ELEMENT ENDED (205)
// ELEMENT STARTED (215)
// ELEMENT ENDED (221)
// ATTRIBUTE ENDED (222)
// ELEMENT ENDED (232)
// ELEMENT STARTED (279)
// ELEMENT ENDED (284)
// ATTRIBUTE STARTED (302)
// ATTRIBUTE ENDED (311)
// ELEMENT STARTED (350)
// ELEMENT ENDED (356)
// ELEMENT STARTED (398)
// ELEMENT ENDED (403)

This is from tokens-1.php

We’re off to a good start. By stepping through the code, we can check to see what each character is (and identify the ones that matter to us). We’re seeing, for instance, that the first element opens when we encounter a < character, at index 5. The first element closes at index 210.

Unfortunately, that first opening is being incorrectly matched to <?php. That’s not an element in our new syntax, so we have to stop the code from picking it out:

preg_match("#^</?[a-zA-Z]#", substr($code, $cursor, 3), $matchesStart);

if (count($matchesStart)) {
    print "ELEMENT STARTED ({$cursor})" . PHP_EOL;
}

// ...

// ELEMENT STARTED (95)
// ATTRIBUTE STARTED (122)
// ELEMENT ENDED (127)
// ATTRIBUTE STARTED (129)
// ATTRIBUTE ENDED (151)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173)
// ELEMENT STARTED (174)
// ...

This is from tokens-2.php

Instead of checking only the current character, our new code checks three characters: if they match the pattern <div or </div, but not <?php or $num1 < $num2.

There’s another problem: our example uses arrow function syntax, so => is being matched as an element closing sequence. Let’s refine how we match element closing sequences:

preg_match("#^=>#", substr($code, $cursor - 1, 2), $matchesEqualBefore);
preg_match("#^>=#", substr($code, $cursor, 2), $matchesEqualAfter);

if ($code[$cursor] === ">" && !$matchesEqualBefore && !$matchesEqualAfter) {
    print "ELEMENT ENDED ({$cursor})" . PHP_EOL;
}

// ...

// ELEMENT STARTED (95)
// ATTRIBUTE STARTED (122)
// ATTRIBUTE STARTED (129)
// ATTRIBUTE ENDED (151)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173)
// ELEMENT STARTED (174)
// ...

This is from tokens-3.php

As with JSX, it would be good for attributes to allow dynamic values (even if those values are nested JSX elements). There are a few ways we could do this, but the one I prefer is to treat all attributes as text, and tokenize them recursively. To do this, we need to have a kind of state machine which tracks how many levels deep we are in an element and attribute. If we’re inside an element tag, we should trap the top level {…} as a string attribute value, and ignore subsequent braces. Similarly, if we’re inside an attribute, we should ignore nested element opening and closing sequences:

function tokens($code) {
    $tokens = [];

    $length = strlen($code);
    $cursor = 0;

    $elementLevel = 0;
    $elementStarted = null;
    $elementEnded = null;

    $attributes = [];
    $attributeLevel = 0;
    $attributeStarted = null;
    $attributeEnded = null;

    while ($cursor < $length) {
        $extract = trim(substr($code, $cursor, 5)) . "...";

        if ($code[$cursor] === "{" && $elementStarted !== null) {
            if ($attributeLevel === 0) {
                print "ATTRIBUTE STARTED ({$cursor}, {$extract})" . PHP_EOL;
                $attributeStarted = $cursor;
            }

            $attributeLevel++;
        }

        if ($code[$cursor] === "}" && $elementStarted !== null) {
            $attributeLevel--;

            if ($attributeLevel === 0) {
                print "ATTRIBUTE ENDED ({$cursor})" . PHP_EOL;
                $attributeEnded = $cursor;
            }
        }

        preg_match("#^</?[a-zA-Z]#", substr($code, $cursor, 3), $matchesStart);

        if (count($matchesStart) && $attributeLevel < 1) {
            print "ELEMENT STARTED ({$cursor}, {$extract})" . PHP_EOL;

            $elementLevel++;
            $elementStarted = $cursor;
        }

        preg_match("#^=>#", substr($code, $cursor - 1, 2), $matchesEqualBefore);
        preg_match("#^>=#", substr($code, $cursor, 2), $matchesEqualAfter);

        if (
            $code[$cursor] === ">"
            && !$matchesEqualBefore && !$matchesEqualAfter
            && $attributeLevel < 1
        ) {
            print "ELEMENT ENDED ({$cursor})" . PHP_EOL;

            $elementLevel--;
            $elementEnded = $cursor;
        }

        if ($elementStarted && $elementEnded) {
            // TODO

            $elementStarted = null;
            $elementEnded = null;
        }

        $cursor++;
    }
}

// ...

// ELEMENT STARTED (95, <div...)
// ATTRIBUTE STARTED (122, {() =...)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173, {<spa...)
// ATTRIBUTE ENDED (222)
// ELEMENT ENDED (232)
// ELEMENT STARTED (279, <span...)
// ELEMENT ENDED (284)
// ELEMENT STARTED (350, </spa...)
// ELEMENT ENDED (356)
// ELEMENT STARTED (398, </div...)
// ELEMENT ENDED (403)

This is from tokens-4.php

We’ve added new $attributeLevel, $attributeStarted, and $attributeEnded variables; to track how deep we are in the nesting of attributes, and where the top-level starts and ends. Specifically, if we’re at the top level when an attribute’s value starts or ends, we capture the current cursor position. Later, we’ll use this to extract the string attribute value and replace it with a placeholder.

We’re also starting to capture $elementStarted and $elementEnded (with $elementLevel fulfilling a similar role to $attributeLevel) so that we can capture a full element opening or closing tag. In this case, $elementEnded doesn’t refer to the closing tag but rather the closing sequence of characters of the opening tag. Closing tags are treated as entirely separate tokens…

After extracting a small substring after the current cursor position, we can see elements and attributes starting and ending exactly where we expect. The nested control structures and elements are captured as strings, leaving only the top-level elements, non-attribute nested elements, and attribute values.

Let’s package these tokens up, associating attributes with the tags in which they are defined:

function tokens($code) {
    $tokens = [];

    $length = strlen($code);
    $cursor = 0;

    $elementLevel = 0;
    $elementStarted = null;
    $elementEnded = null;

    $attributes = [];
    $attributeLevel = 0;
    $attributeStarted = null;
    $attributeEnded = null;

    $carry = 0;

    while ($cursor < $length) {
        if ($code[$cursor] === "{" && $elementStarted !== null) {
            if ($attributeLevel === 0) {
                $attributeStarted = $cursor;
            }

            $attributeLevel++;
        }

        if ($code[$cursor] === "}" && $elementStarted !== null) {
            $attributeLevel--;

            if ($attributeLevel === 0) {
                $attributeEnded = $cursor;
            }
        }

        if ($attributeStarted && $attributeEnded) {
            $position = (string) count($attributes);
            $positionLength = strlen($position);

            $attribute = substr(
                $code, $attributeStarted + 1, $attributeEnded - $attributeStarted - 1
            );

            $attributes[$position] = $attribute;

            $before = substr($code, 0, $attributeStarted + 1);
            $after = substr($code, $attributeEnded);

            $code = $before . $position . $after;

            $cursor = $attributeStarted + $positionLength + 2 /* curlies */;
            $length = strlen($code);

            $attributeStarted = null;
            $attributeEnded = null;

            continue;
        }

        preg_match("#^</?[a-zA-Z]#", substr($code, $cursor, 3), $matchesStart);

        if (count($matchesStart) && $attributeLevel < 1) {
            $elementLevel++;
            $elementStarted = $cursor;
        }

        preg_match("#^=>#", substr($code, $cursor - 1, 2), $matchesEqualBefore);
        preg_match("#^>=#", substr($code, $cursor, 2), $matchesEqualAfter);

        if (
            $code[$cursor] === ">"
            && !$matchesEqualBefore && !$matchesEqualAfter
            && $attributeLevel < 1
        ) {
            $elementLevel--;
            $elementEnded = $cursor;
        }

        if ($elementStarted !== null && $elementEnded !== null) {
            $distance = $elementEnded - $elementStarted;

            $carry += $cursor;

            $before = trim(substr($code, 0, $elementStarted));
            $tag = trim(substr($code, $elementStarted, $distance + 1));
            $after = trim(substr($code, $elementEnded + 1));

            $token = ["tag" => $tag, "started" => $carry];

            if (count($attributes)) {
                $token["attributes"] = $attributes;
            }

            $tokens[] = $before;
            $tokens[] = $token;

            $attributes = [];

            $code = $after;
            $length = strlen($code);
            $cursor = 0;

            $elementStarted = null;
            $elementEnded = null;

            continue;
        }

        $cursor++;
    }

    return $tokens;
}

$code = '
    <?php

    $classNames = "foo bar";
    $message = "hello world";

    $thing = (
        <div
            className={() => { return "outer-div"; }}
            nested={<span className={"nested-span"}>with text</span>}
        >
            a bit of text before
            <span>
                {$message} with a bit of extra text
            </span>
            a bit of text after
        </div>
    );
';

tokens($code);

// Array
// (
//     [0] => <?php
//
//     $classNames = "foo bar";
//     $message = "hello world";
//
//     $thing = (
//     [1] => Array
//         (
//             [tag] => <div className={0} nested={1}>
//             [started] => 157
//             [attributes] => Array
//                 (
//                     [0] => () => { return "outer-div"; }
//                     [1] => <span className={"nested-span"}>with text</span>
//                 )
//
//         )
//
//     [2] => a bit of text before
//     [3] => Array
//         (
//             [tag] => <span>
//             [started] => 195
//         )
//
//     [4] => {$message} with a bit of extra text
//     [5] => Array
//         (
//             [tag] => </span>
//             [started] => 249
//         )
//
//     [6] => a bit of text after
//     [7] => Array
//         (
//             [tag] => </div>
//             [started] => 282
//         )
//
// )

This is from tokens-5.php

There’s a lot going on here, but it’s all just a natural progression from the previous version. We use the captured attribute start and end positions to extract the entire attribute value as one big string. We then replace each captured attribute with a numeric placeholder and reset the code string and cursor positions.

As each element closes, we associate all the attributes since the element was opened, and create a separate array token from the tag (with its placeholders), attributes and starting position. The result may be a little harder to read, but it is spot on in terms of capturing the intent of the code.

So, what do we do about those nested element attributes?

function tokens($code) {
    // ...

    while ($cursor < $length) {
        // ...

        if ($elementStarted !== null && $elementEnded !== null) {
            // ...

            foreach ($attributes as $key => $value) {
                $attributes[$key] = tokens($value);
            }

            if (count($attributes)) {
                $token["attributes"] = $attributes;
            }

            // ...
        }

        $cursor++;
    }

    $tokens[] = trim($code);

    return $tokens;
}

// ...

// Array
// (
//     [0] => <?php
//
//     $classNames = "foo bar";
//     $message = "hello world";
//
//     $thing = (
//     [1] => Array
//         (
//             [tag] => <div className={0} nested={1}>
//             [started] => 157
//             [attributes] => Array
//                 (
//                     [0] => Array
//                         (
//                             [0] => () => { return "outer-div"; }
//                         )
//
//                     [1] => Array
//                         (
//                             [1] => Array
//                                 (
//                                     [tag] => <span className={0}>
//                                     [started] => 19
//                                     [attributes] => Array
//                                         (
//                                             [0] => Array
//                                                 (
//                                                     [0] => "nested-span"
//                                                 )
//
//                                         )
//
//                                 )
//
//                            [2] => with text
//                            [3] => Array
//                                (
//                                    [tag] => </span>
//                                    [started] => 34
//                                )
//                         )
//
//                 )
//
//         )
//
// ...

This is from tokens-5.php (modified)

Before we associate the attributes, we loop through them and tokenize their values with a recursive function call. We also need to append any remaining text (not inside an attribute or element tag) to the tokens array or it will be ignored.

The result is a list of tokens which can have nested lists of tokens. It’s almost an AST already.

Organizing Tokens

Let’s transform this list of tokens into something more like an AST. The first step is to exclude closing tags that match opening tags. We need to identify which tokens are tags:

function nodes($tokens) {
    $cursor = 0;
    $length = count($tokens);

    while ($cursor < $length) {
        $token = $tokens[$cursor];

        if (is_array($token)) {
            print $token["tag"] . PHP_EOL;
        }

        $cursor++;
    }
}

$tokens = [
    0 => '<?php

    $classNames = "foo bar";
    $message = "hello world";

    $thing = (',
    1 => [
        'tag' => '<div className={0} nested={1}>',
        'started' => 157,
        'attributes' => [
            0 => [
                0 => '() => { return "outer-div"; }',
            ],
            1 => [
                1 => [
                    'tag' => '<span className={0}>',
                    'started' => 19,
                    'attributes' => [
                        0 => [
                            0 => '"nested-span"',
                        ],
                    ],
                ],
                2 => 'with text</span>',
            ],
        ],
    ],
    2 => 'a bit of text before',
    3 => [
        'tag' => '<span>',
        'started' => 195,
    ],
    4 => '{$message} with a bit of extra text',
    5 => [
        'tag' => '</span>',
        'started' => 249,
    ],
    6 => 'a bit of text after',
    7 => [
        'tag' => '</div>',
        'started' => 282,
    ],
    8 => ');',
];

nodes($tokens);

// <div className={0} nested={1}>
// <span>
// </span>
// </div>

This is from nodes-1.php

I’ve extracted a list of tokens from the last token script, so that I don’t have to run and debug that function anymore. Inside a loop, similar to the one we used during tokenization, we print just the non-attribute element tags. Let’s figure out if they’re opening or closing tags, and also whether the closing tags match the opening ones:

function nodes($tokens) {
    $cursor = 0;
    $length = count($tokens);

    while ($cursor < $length) {
        $token = $tokens[$cursor];

        if (is_array($token) && $token["tag"][1] !== "/") {
            preg_match("#^<([a-zA-Z]+)#", $token["tag"], $matches);

            print "OPENING {$matches[1]}" . PHP_EOL;
        }

        if (is_array($token) && $token["tag"][1] === "/") {
            preg_match("#^</([a-zA-Z]+)#", $token["tag"], $matches);

            print "CLOSING {$matches[1]}" . PHP_EOL;
        }

        $cursor++;
    }

    return $tokens;
}

// ...

// OPENING div
// OPENING span
// CLOSING span
// CLOSING div

This is from nodes-1.php (modified)

Now that we know which tags are opening tags and which ones are closing ones; we can use reference variables to construct a tree:

function nodes($tokens) {
    $nodes = [];
    $current = null;

    $cursor = 0;
    $length = count($tokens);

    while ($cursor < $length) {
        $token =& $tokens[$cursor];

        if (is_array($token) && $token["tag"][1] !== "/") {
            preg_match("#^<([a-zA-Z]+)#", $token["tag"], $matches);

            if ($current !== null) {
                $token["parent"] =& $current;
                $current["children"][] =& $token;
            } else {
                $token["parent"] = null;
                $nodes[] =& $token;
            }

            $current =& $token;
            $current["name"] = $matches[1];
            $current["children"] = [];

            if (isset($current["attributes"])) {
                foreach ($current["attributes"] as $key => $value) {
                    $current["attributes"][$key] = nodes($value);
                }

                $current["attributes"] = array_map(function($item) {

                    foreach ($item as $value) {
                        if (isset($value["tag"])) {
                            return $value;
                        }
                    }

                    foreach ($item as $value) {
                        if (!empty($value["token"])) {
                            return $value;
                        }
                    }

                    return null;

                }, $current["attributes"]);
            }
        }

        else if (is_array($token) && $token["tag"][1] === "/") {
            preg_match("#^</([a-zA-Z]+)#", $token["tag"], $matches);

            if ($current === null) {
                throw new Exception("no open tag");
            }

            if ($matches[1] !== $current["name"]) {
                throw new Exception("no matching open tag");
            }

            if ($current !== null) {
                $current =& $current["parent"];
            }
        }

        else if ($current !== null) {
            array_push($current["children"], [
                "parent" => &$current,
                "token" => &$token,
            ]);
        }

        else {
            array_push($nodes, [
                "token" => $token,
            ]);
        }

        $cursor++;
    }

    return $nodes;
}

// ...

// Array
// (
//     [0] => Array
//         (
//             [token] => <?php
//
//     $classNames = "foo bar";
//     $message = "hello world";
//
//     $thing = (
//         )
//
//     [1] => Array
//         (
//             [tag] => <div className={0} nested={1}>
//             [started] => 157
//             [attributes] => Array
//                 (
//                     [0] => Array
//                         (
//                             [token] => () => { return "outer-div"; }
//                         )
//
//                     [1] => Array
//                         (
//                             [tag] => <span className={0}>
//                             [started] => 19
//                             [attributes] => Array
//                                 (
//                                     [0] => Array
//                                         (
//                                             [token] => "nested-span"
//                                         )
//
//                                 )
//
//                             [parent] =>
//                             [name] => span
//                             [children] => Array
//                                 (
//                                     [0] => Array
//                                         (
//                                             [parent] => *RECURSION*
//                                             [token] => with text
//                                         )
//
//                                 )
//
//                         )
//
//                 )
//
//             [parent] =>
//             [name] => div
//             [children] => Array
//                 (
//                     [0] => Array
//                         (
//                             [parent] => *RECURSION*
//                             [token] => a bit of text before
//                         )
//
//                     [1] => Array
//                         (
//                             [tag] => <span>
//                             [started] => 195
//                             [parent] => *RECURSION*
//                             [name] => span
//                             [children] => Array
//                                 (
//                                     [0] => Array
//                                         (
//                                             [parent] => *RECURSION*
//                                             [token] => {$message} with ...
//                                         )
//
//                                 )
//
//                         )
//
//                     [2] => Array
//                         (
//                             [parent] => *RECURSION*
//                             [token] => a bit of text after
//                         )
//
//                 )
//
//         )
//
//     [2] => Array
//         (
//             [token] => );
//         )
//
// )

This is from nodes-2.php

Take some time to study what’s going on here. We create a $nodes array, in which to store the new, organized node structures. We also have a $current variable, to which we assign each opening tag node by reference. This way, we can step down into each element (opening tag, closing tag, and the tokens in between); as well as stepping back up when we encounter a closing tag.

The references are the most tricky part about this, but they’re essential to keeping the code relatively simple. I mean, it’s not that simple; but it is much simpler than a non-reference version.

We don’t have the cleanest function in terms of how it works recursively. So, when we pass the attributes through the nodes function, we sometimes get empty “token” attributes alongside nested tag attributes. Because of this, we need to filter the attributes to first try and return a nested tag before returning a non-empty token attribute value. This could be cleaned up quite a bit…

Rewriting Code

Now that the code is neatly arranged in a hierarchy or AST, we can rewrite it into valid PHP code. Let’s begin by writing just the string tokens (which aren’t nested inside elements), and formatting the resulting code:

function parse($nodes) {
    $code = "";

    foreach ($nodes as $node) {
        if (isset($node["token"])) {
            $code .= $node["token"] . PHP_EOL;
        }
    }

    return $code;
}

$nodes = [
    0 => [
        'token' => '<?php

        $classNames = "foo bar";
        $message = "hello world";

        $thing = (',
    ],
    1 => [
        'tag' => '<div className={0} nested={1}>',
        'started' => 157,
        'attributes' => [
            0 => [
                'token' => '() => { return "outer-div"; }',
            ],
            1 => [
                'tag' => '<span className={0}>',
                'started' => 19,
                'attributes' => [
                    0 => [
                        'token' => '"nested-span"',
                    ],
                ],
                'name' => 'span',
                'children' => [
                    0 => [
                        'token' => 'with text',
                    ],
                ],
            ],
        ],
        'name' => 'div',
        'children' => [
            0 => [
                'token' => 'a bit of text before',
            ],
            1 => [
                'tag' => '<span>',
                'started' => 195,
                'name' => 'span',
                'children' => [
                    0 => [
                        'token' => '{$message} with a bit of extra text',
                    ],
                ],
            ],
            2 => [
                'token' => 'a bit of text after',
            ],
        ],
    ],
    2 => [
        'token' => ');',
    ],
];

parse($nodes);

// <?php
//
// $classNames = "foo bar";
// $message = "hello world";
//
// $thing = (
// );

This is from parser-1.php

I’ve copied the nodes extracted from the previous script, so we don’t have to debug or reuse that function again. Let’s deal with the elements as well:

require __DIR__ . "/vendor/autoload.php";

function parse($nodes) {
    $code = "";

    foreach ($nodes as $node) {
        if (isset($node["token"])) {
            $code .= $node["token"] . PHP_EOL;
        }

        if (isset($node["tag"])) {
            $props = [];
            $attributes = [];
            $elements = [];

            if (isset($node["attributes"])) {
                foreach ($node["attributes"] as $key => $value) {
                    if (isset($value["token"])) {
                        $attributes["attr_{$key}"] = $value["token"];
                    }

                    if (isset($value["tag"])) {
                        $elements[$key] = true;
                        $attributes["attr_{$key}"] = parse([$value]);
                    }
                }
            }

            preg_match_all("#([a-zA-Z]+)={([^}]+)}#", $node["tag"], $dynamic);
            preg_match_all("#([a-zA-Z]+)=[']([^']+)[']#", $node["tag"], $static);

            if (count($dynamic[0])) {
                foreach($dynamic[1] as $key => $value) {
                    $props["{$value}"] = $attributes["attr_{$key}"];
                }
            }

            if (count($static[1])) {
                foreach($static[1] as $key => $value) {
                    $props["{$value}"] = $static[2][$key];
                }
            }

            $code .= "pre_" . $node["name"] . "([" . PHP_EOL;

            foreach ($props as $key => $value) {
                $code .= "'{$key}' => {$value}," . PHP_EOL;
            }

            $code .= "])" . PHP_EOL;
        }
    }

    $code = Pre\Plugin\expand($code);
    $code = Pre\Plugin\formatCode($code);

    return $code;
}

// ...

// <?php
//
// $classNames = "foo bar";
// $message = "hello world";
//
// $thing = (
//     pre_div([
//         'className' => function () {
//             return "outer-div";
//         },
//         'nested' => pre_span([
//             'className' => "nested-span",
//         ]),
//     ])
// );

This is from parser-2.php

When we find a tag node, we loop through the attributes and build a new attributes array that is either just text from token nodes or parsed tags from tag nodes. This bit of recursion deals with the possibility of attributes that are nested elements. Our regular expression only handles attributes quoted with single quotes (for the sake of simplicity). Feel free to make a more comprehensive expression, to handle more complex attribute syntax and values.

I went ahead and installed pre/short-closures, so that the arrow function would be expanded to a regular function:

composer require pre/short-closures

There’s also a handle PSR-2 formatting function in there, so our code is formatted according to the standard.

Finally, we need to deal with children:

require __DIR__ . "/vendor/autoload.php";

function parse($nodes) {
    $code = "";

    foreach ($nodes as $node) {
        if (isset($node["token"])) {
            $code .= $node["token"] . PHP_EOL;
        }

        if (isset($node["tag"])) {
            // ...

            $children = [];

            foreach ($node["children"] as $child) {
                if (isset($child["tag"])) {
                    $children[] = parse([$child]);
                }

                else {
                    $children[] = "\"" . addslashes($child["token"]) . "\"";
                }
            }

            $props["children"] = $children;

            $code .= "pre_" . $node["name"] . "([" . PHP_EOL;

            foreach ($props as $key => $value) {
                if ($key === "children") {
                    $code .= "\"children\" => [" . PHP_EOL;

                    foreach ($children as $child) {
                        $code .= "{$child}," . PHP_EOL;
                    }

                    $code .= "]," . PHP_EOL;
                }

                else {
                    $code .= "\"{$key}\" => {$value}," . PHP_EOL;
                }
            }

            $code .= "])" . PHP_EOL;
        }
    }

    $code = Pre\Plugin\expand($code);
    $code = Pre\Plugin\formatCode($code);

    return $code;
}

// ...

// <?php
//
// $classNames = "foo bar";
// $message = "hello world";
//
// $thing = (
//     pre_div([
//         "className" => function () {
//             return "outer-div";
//         },
//         "nested" => pre_span([
//             "className" => "nested-span",
//             "children" => [
//                 "with text",
//             ],
//         ]),
//         "children" => [
//             "a bit of text before",
//             pre_span([
//                 "children" => [
//                     "{$message} with a bit of extra text",
//                 ],
//             ]),
//             "a bit of text after",
//         ],
//     ])
// );

This is from parser-3.php

We parse each tag child, and directly quote each token child (adding slashes to account for nested quotes). Then, when we’re building the parameter array; we loop over the children and add each to the string of code our parse function ultimately returns.

Each tag is converted to an equivalent pre_div or pre_span function. This is a placeholder mechanism for a larger, underlying primitive element system. We can demonstrate this by stubbing those functions:

require __DIR__ . "/vendor/autoload.php";

function pre_div($props) {
    $code = "<div";

    if (isset($props["className"])) {
        if (is_callable($props["className"])) {
            $class = $props["className"]();
        }

        else {
            $class = $props["className"];
        }

        $code .= " class='{$class}'";
    }

    $code .= ">";

    foreach ($props["children"] as $child) {
        $code .= $child;
    }

    $code .= "</div>";

    return trim($code);
}

function pre_span($props) {
    $code = pre_div($props);
    $code = preg_replace("#^<div#", "<span", $code);
    $code = preg_replace("#div>$#", "span>", $code);

    return $code;
}

function parse($nodes) {
    // ...
}

$nodes = [
    0 => [
        'token' => '<?php

        $classNames = "foo bar";
        $message = "hello world";

        $thing = (',
    ],
    1 => [
        'tag' => '<div className={0} nested={1}>',
        'started' => 157,
        'attributes' => [
            0 => [
                'token' => '() => { return $classNames; }',
            ],
            1 => [
                'tag' => '<span className={0}>',
                'started' => 19,
                'attributes' => [
                    0 => [
                        'token' => '"nested-span"',
                    ],
                ],
                'name' => 'span',
                'children' => [
                    0 => [
                        'token' => 'with text',
                    ],
                ],
            ],
        ],
        'name' => 'div',
        'children' => [
            0 => [
                'token' => 'a bit of text before',
            ],
            1 => [
                'tag' => '<span>',
                'started' => 195,
                'name' => 'span',
                'children' => [
                    0 => [
                        'token' => '{$message} with a bit of extra text',
                    ],
                ],
            ],
            2 => [
                'token' => 'a bit of text after',
            ],
        ],
    ],
    2 => [
        'token' => ');',
    ],
    3 => [
        'token' => 'print $thing;',
    ],
];

eval(substr(parse($nodes), 5));

// <div class='foo bar'>
//     a bit of text before
//     <span>
//         hello world with a bit of extra text
//     </span>
//     a bit of text after
// </div>

This is from parser-4.php

I’ve modified the input nodes, so that $thing will be printed. If we implement a naive version of pre_div and pre_span then this code executes successfully. It’s actually hard to believe, given how little code we’ve actually written…

Integrating with Pre

The question is: what do we with with this?

It’s an interesting experiment, but it’s not very usable. What would be better is to have a way to drop this into an existing project, and experiment with component-based design in the real world. To this end, I extended Pre to allow for custom compilers (along with the custom macro definitions it already allows).

Then, I packaged the tokens, nodes, and parse functions into a re-usable library. It took quite a while to do this and, between the time I first created the functions and built an example application using them, I improved them quite a bit. Some improvements were small (like creating a set of HTML component primitives), and some were big (like refactoring expressions and allowing custom component classes).

I’m not going to go over all these changes, but I’d like to show you what that example application looks like. It begins with a server script:

use Silex\Application;
use Silex\Provider\SessionServiceProvider;
use Symfony\Component\HttpFoundation\Request;

use App\Component\AddTask;
use App\Component\Page;
use App\Component\TaskList;

$app = new Application();
$app->register(new SessionServiceProvider());

$app->get("/", (Request $request) => {
    $session = $request->getSession();

    $tasks = $session->get("tasks", []);

    return (
        <Page>
            <TaskList>{$tasks}</TaskList>
            <AddTask></AddTask>
        </Page>
    );
});

$app->post("/add", (Request $request) => {
    $session = $request->getSession();

    $id = $session->get("id", 0);
    $tasks = $session->get("tasks", []);

    $tasks[] = [
        "id" => $id++,
        "text" => $request->get("text"),
    ];

    $session->set("id", $id);
    $session->set("tasks", $tasks);

    return $app->redirect("/");
});

$app->get("/remove/{id}", (Request $request, $id) => {
    $session = $request->getSession();

    $tasks = $session->get("tasks", []);

    $tasks = array_filter($tasks, ($task) => {
        return $task["id"] !== (int) $id;
    });

    $session->set("tasks", $tasks);

    return $app->redirect("/");
});

$app->run();

This is from server.pre

The application is built on top of Silex, which is a neat micro-framework. In order to load this server script, I have an index file:

require __DIR__ . "/../vendor/autoload.php";

Pre\Plugin\process(__DIR__ . "/../server.pre");

This is from public/index.php

…And I serve this with:

php -S localhost:8080 -t public public/index.php

I haven’t yet tried running this through a web server, like Apache or Nginx. I believe it would run in much the same way.

The server scripts begins with me setting up the Silex server. I define a few routes, the first of which fetches an array of tasks from the current session. If that array hasn’t been defined, I default it to an empty array.

I pass these directly, as children of the TaskList component. I’ve wrapped this, and the AddTask component, inside a Page component. The Page component looks like this:

namespace App\Component;

use InvalidArgumentException;

class Page
{
    public function render($props)
    {
        assert($this->hasValid($props));

        { $children } = $props;

        return (
            "<!doctype html>".
            <html lang="en">
                <body>
                    {$children}
                </body>
            </html>
        );
    }

    private function hasValid($props)
    {
        if (empty($props["children"])) {
            throw new InvalidArgumentException("page needs content (children)");
        }

        return true;
    }
}

This is from app/Component/Page.pre

This component isn’t strictly necessary, but I want to declare the doctype and make space for future header things (like stylesheets and meta tags). I destructure the $props associative array (using some pre/collections syntax) and pass this into the <body> element.

Then there’s the TaskList component:

namespace App\Component;

class TaskList
{
    public function render($props)
    {
        { $children } = $props;

        return (
            <ul className={"task-list"}>
                {$this->children($children)}
            </ul>
        );
    }

    private function children($children)
    {
        if (count($children)) {
            return {$children}->map(($task) => {
                return (
                    <Task id={$task["id"]}>{$task["text"]}</Task>
                );
            });
        }

        return (
            <span>No tasks</span>
        );
    }
}

This is from app/Component/TaskList.pre

Elements can have dynamic attributes. In fact, this library doesn’t support them having literal (quoted) attribute values. They’re complicated to support, in addition to these dynamic attribute values. I’m defining the className attribute; which supports a few different formats:

  1. A literal value expression, like "task-list"
  2. An array (or key-less pre/collection object), like ["first", "second"]
  3. An associative array (or keyed pre/collection object), like ["first" => true, "second" => false]

This is similar to the className attribute in ReactJS. The keyed or object form uses the truthiness of values to determine whether the keys are appended to the element’s class attribute.

All the default elements support non-deprecated and non-experimental attributes defined in the Mozilla Developer Network documentation. All elements support an associative array for their style attribute, which uses the kebab-case form of CSS style keys.

Finally, all elements support data- and aria- attributes, and all attribute values may be functions which return their true values (as a form of lazy loading).

Let’s look at the Task component:

namespace App\Component;

use InvalidArgumentException;

class Task
{
    public function render($props)
    {
        assert($this->hasValid($props));

        { $children, $id } = $props;

        return (
            <li className={"task"}>
                {$children}
                <a href={"/remove/{$id}"}>remove</a>
            </li>
        );
    }

    private function hasValid($props)
    {
        if (!isset($props["id"])) {
            throw new InvalidArgumentException("task needs id (attribute)");
        }

        if (empty($props["children"])) {
            throw new InvalidArgumentException("task needs text (children)");
        }

        return true;
    }
}

This is from app/Component/Task.pre

Each task expects an id defined for each task (which server.pre defines), and some children. The children are used for the textual representation of a task, and are defined where the tasks are created, in the TaskList component.

Finally, let’s look at the AddTask component:

namespace App\Component;

class AddTask
{
    public function render($props)
    {
        return (
            <form method={"post"} action={"/add"} className={"add-task"}>
                <input name={"text"} type={"text"} />
                <button type={"submit"}>add</button>
            </form>
        );
    }
}

This is from app/Component/AddTask.pre

This component demonstrates a self-closing input component, and little else. Of course, the add and remove functionality needs to be defined (in the server script):

$app->post("/add", (Request $request) => {
    $session = $request->getSession();

    $id = $session->get("id", 1);
    $tasks = $session->get("tasks", []);

    $tasks[] = [
        "id" => $id++,
        "text" => $request->get("text"),
    ];

    $session->set("id", $id);
    $session->set("tasks", $tasks);

    return $app->redirect("/");
});

$app->get("/remove/{id}", (Request $request, $id) => {
    $session = $request->getSession();

    $tasks = $session->get("tasks", []);

    $tasks = array_filter($tasks, ($task) => {
        return $task["id"] !== (int) $id;
    });

    $session->set("tasks", $tasks);

    return $app->redirect("/");
});

This is from server.pre

We’re not storing anything in a database, but we could. These components and scripts are all that there is to the example application. It’s not a huge example, but it does demonstrate various important things, like component nesting and iterative component rendering.

It’s also a good example of how some of the different Pre macros work well together; particularly short closures, collections, and in certain cases async/await.

Here’s a gif of it in action.

Gif of this in action

Watch Learn Database and Security Techniques with PHP
Learning more about PHP, one video at a time

Phack

While I was working on this project, I rediscovered a project called Phack, by Sara Golemon. It’s a similar sort of project to Pre, which seeks to transpile a PHP superset language (in this case, Hack) into regular PHP.

The readme lists the Hack features that Phack aims to support, and their status. One of those features is XHP. If you’ve always wanted to write Hack code, but still use standard PHP tools; I recommend checking it out. I’m a huge fan of Sara and her work, so I’ll definitely be keeping an eye on Phack.

Summary

This has been a whirlwind tour of simple compiler creation. We learned how to build a basic state-machine compiler, and how to get it to support HTML-like syntax inside regular PHP syntax. We also looked at how that might work in an example application.

I’d like to encourage you to try this out. Perhaps you’d like to add your own syntax to PHP – which you could do with Pre. Perhaps you’d like to change PHP radically. I hope this tutorial has demonstrated one way to do that, well enough that you feel up to the challenge. Remember: creating compilers doesn’t take a huge amount of knowledge or training. Just simple string manipulation, and some trial and error.

Let us know what you come up with!

Sponsors