Spitting string of javascript code into tokens

To start with my PHP is pretty poor.

My goal is to take a javascript code block (string format) and wrap strings, comments, keywords, built-ins etc in span tags with the appropriate class name. I want to do some colouring in :grinning_face_with_smiling_eyes:

I need to be able to isolate those parts, so for instance I don’t want keywords matching ‘this’ in a string, or ‘for’ in a comment. Order seems to be important here.

I’m looking at preg_split which is actually doing quite a nice job. The downside though is I need that little bit of extra data in the form of a classname — a tuplet I think it is what I am after.

So instead of getting this

  [1]=>
  string(5) "const"
  [2]=>
  string(6) " x =  "
  [3]=>
  string(2) "10"

I end up with something like this

  [1]=>
  array(2) ["const", "js_keyword"]
  [2]=>
  array(1) [" x =  "]
  [3]=>
  array(2) ["10", "js_number"]

I’m thinking preg_split isn’t goint to cut it, preg_split_callback might have been nice, but it illustrates where I am going with this.

In the end, I want to re-assemble with something like array_reduce, wrapping the returned strings in spans if index 1 exists.

This is a sample of what I am playing with

<?php

$codeTypes = [
    'js_string' => '((["\'`])[^\2]+?\2)',
    
    'js_comment' => '((?<!:)\/\/.*|\/\*[\s\S]+\*\/)',

    'js_keyword' => '(\babstract\b|\barguments\b|\bawait\b|\bboolean\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b|\bchar\b|\bclass(?!=)\b|\bconst\b|\bcontinue\b|\bdebugger\b|\bdefault\b|\bdelete\b|\bdo\b|\bdouble\b|\belse\b|\benum\b|\beval\b|\bexport\b|\bextends\b|\bfalse\b|\bfinal\b|\bfinally\b|\bfloat\b|\bfor\b|\bfunction\b|\bgoto\b|\bif\b|\bimplements\b|\bimport\b|\bin\b|\binstanceof\b|\bint\b|\binterface\b|\blet\b|\blong\b|\bnative\b|\bnew\b|\bnull\b|\bpackage\b|\bprivate\b|\bprotected\b|\bpublic\b|\breturn\b|\bshort\b|\bstatic\b|\bsuper\b|\bswitch\b|\bsynchronized\b|\bthis\b|\bthrow\b|\bthrows\b|\btransient\b|\btrue\b|\btry\b|\btypeof\b|\bvar\b|\bvoid\b|\bvolatile\b|\bwhile\b|\bwith\b|\byield\b)'
];

$sampleHtml = <<<END
        const x = 10 // the number 10
        const entries = Object.entries({x: 2, y: 6})
        
        for(let i = 0; i < x; i++) {
            if (i % 2 == 0) console.log('this i is even')
        }
        /*
            this is a 
            comment block
            not an Object
        */
        const elements = document.querySelectorAll('.my-elements') // a 'string'

        class MyClass {

            constructor(x, y) {
                this.x = x;
                this.y = y
            }
        }
END;

var_dump(preg_split('/' . implode('|', $codeTypes) . '/', $sampleHtml, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE));

output

array(31) {
  [0]=>
  string(8) "        "
  [1]=>
  string(5) "const"
  [2]=>
  string(8) " x = 10 "
  [3]=>
  string(17) "// the number 10
"
  [4]=>
  string(9) "
        "
  [5]=>
  string(5) "const"
  [6]=>
  string(59) " entries = Object.entries({x: 2, y: 6})
        
        "
  [7]=>
  string(3) "for"
  [8]=>
  string(1) "("
  [9]=>
  string(3) "let"
  [10]=>
  string(35) " i = 0; i < x; i++) {
            "
  [11]=>
  string(2) "if"
  [12]=>
  string(26) " (i % 2 == 0) console.log("
  [13]=>
  string(16) "'this i is even'"
  [14]=>
  string(1) "'"
  [15]=>
  string(22) ")
        }
        "
  [16]=>
  string(92) "/*
            this is a 
            comment block
            not an Object
        */"
  [17]=>
  string(10) "
        "
  [18]=>
  string(5) "const"
  [19]=>
  string(38) " elements = document.querySelectorAll("
  [20]=>
  string(14) "'.my-elements'"
  [21]=>
  string(1) "'"
  [22]=>
  string(2) ") "
  [23]=>
  string(14) "// a 'string'
"
  [24]=>
  string(11) "

        "
  [25]=>
  string(5) "class"
  [26]=>
  string(63) " MyClass {

            constructor(x, y) {
                "
  [27]=>
  string(4) "this"
  [28]=>
  string(25) ".x = x;
                "
  [29]=>
  string(4) "this"
  [30]=>
  string(32) ".y = y
            }
        }"
}

I am aware of highlightJS, some very clever coding, but I have got my teeth into this now and it saves that extra dependency.

Advice would be appreciated.

I think I might have done it?! Using the same regex for split inside preg_replace_callback and checking against which capture matched.

$parts = preg_split('/' . implode('|', $codeTypes) . '/', $sampleHtml, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

$result = preg_replace_callback(
    '/' . implode('|', $codeTypes) . '/', 
    function ($matches) {
        
        $classNames = ['', 'js_string', '', 'js_comment', 'js_keyword'];
        
        foreach($matches as $index => $match) {
            
            if ($matches[$index] && $classNames[$index]) {
                return "<span class='$classNames[$index]'>$matches[$index]</span>";
            }
        }
        // if no matches to type just return string as is
        return $matches[0];
    }, 
    $parts
);

echo implode('', $result);

output

        <span class='js_keyword'>const</span> x = 10 <span class='js_comment'>// the number 10
</span>
        <span class='js_keyword'>const</span> entries = Object.entries({x: 2, y: 6})
        
        <span class='js_keyword'>for</span>(<span class='js_keyword'>let</span> i = 0; i < x; i++) {
            <span class='js_keyword'>if</span> (i % 2 == 0) console.log(<span class='js_string'>'this i is even'</span>')
        }
        <span class='js_comment'>/*
            this is a 
            comment block
            not an Object
        */</span>
        <span class='js_keyword'>const</span> elements = document.querySelectorAll(<span class='js_string'>'.my-elements'</span>') <span class='js_comment'>// a 'string'
</span>

        <span class='js_keyword'>class</span> MyClass {

            constructor(x, y) {
                <span class='js_keyword'>this</span>.x = x;
                <span class='js_keyword'>this</span>.y = y
            }
        }

If there is a more elegant solution, or some refactoring I would be interested. Thanks.

Horrific

The lack of lexical scoping, in particular for accessing functions is driving me nuts. Is there a better way to avoid this sort of thing?

function ($matches) use ($splitCodeToParts, $addSpansToCode, $regex_codetypes_combined)

For instance I’m having to shunt $addSpansToCode up to the preg_replace_callback inside $formatCodeWithin

$regex_codetypes_combined = '/' . implode('|', $regex_codetypes) . '/';

$addSpansToCode = function(array $code_parts, string $regex_codetypes_combined) {

    $spanned_code_parts = preg_replace_callback(
        $regex_codetypes_combined,
        
        function ($matches) {

            $classnames = ['', 'js_string', '', 'js_comment', 'js_number', 'js_keyword', 'js_builtin'];

            foreach($matches as $index => $match) {

                if ($matches[$index] && $classnames[$index]) {
                    return "<span class='$classnames[$index]'>$matches[$index]</span>";
                }
            }

            return $matches[0];
        },
        $code_parts
    );
    
    // combine and return spanned code parts
    return implode('', $spanned_code_parts);
};

$formatCodeWithin = function(string $html) use ($regex_codetypes_combined, $addSpansToCode) {

    // isolate code blocks from rest of html
    $regex_codeblock = "/(<code[^>]*>)([\s\S]+?)(<\/code>)/m";

    $splitCodeToParts = fn($code) =>
        preg_split($regex_codetypes_combined, $code, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

    return preg_replace_callback(
        $regex_codeblock,

        function ($matches) use ($splitCodeToParts, $addSpansToCode, $regex_codetypes_combined){

            [$match, $openTag, $code, $closedTag] = $matches;

            return $openTag . $addSpansToCode($splitCodeToParts($code), $regex_codetypes_combined) . $closedTag;
        },
        $html
    );
};

It makes it very difficult to keep functions short and sweet.

Thanks

The only way to get rid of the scoping is to wrap it all in a class I’d say …

final class JsHighlighter
{
    public function highlight(string $code)
    {
        // isolate code blocks from rest of html
        $regex_codeblock = "/(<code[^>]*>)([\s\S]+?)(<\/code>)/m";

        $splitCodeToParts = fn($code) =>
            preg_split($regex_codetypes_combined, $code, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

        return preg_replace_callback(
            $regex_codeblock,

            function ($matches) use ($splitCodeToParts) {
                [$match, $openTag, $code, $closedTag] = $matches;

                return $openTag . $this->addSpansToCode($splitCodeToParts($code)) . $closedTag;
            },
            $code
        );
    }

    private function addSpansToCode(array $code_parts)
    {
        $spanned_code_parts = preg_replace_callback(
            $this->getCombinedCodeTypes(),
            function ($matches) {
                $classnames = ['', 'js_string', '', 'js_comment', 'js_number', 'js_keyword', 'js_builtin'];

                foreach ($matches as $index => $match) {
                    if ($matches[$index] && $classnames[$index]) {
                        return "<span class='$classnames[$index]'>$matches[$index]</span>";
                    }
                }

                return $matches[0];
            },
            $code_parts
        );
        
        // combine and return spanned code parts
        return implode('', $spanned_code_parts);
    }

    private function getCombinedCodeTypes(): string
    {
        return '/' . implode('|', $regex_codetypes) . '/';
    }
}

You could even make all these methods static so you don’t have to instantiate the class first.

1 Like

Thanks for the code @rpkamp nice one :+1:

I had a feeling that maybe the answer. I will have a good look at that.

edit: Good name changes as well!!

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.