Matching template expression

Pre a bout of covid I had a personal project on, which took javascript code as a text file and syntax highlighted it.

Coming out of covid with brain fog, I abandoned it.

One of the issues I didn’t figure out was how to syntax highlight template string expressions.

I have re-visited it with the following code

function matchExpression (source) {
  const matches = [];
  // create chunks from string, splitting on braces e.g.
  // source: `${fruit.map((fruit, i) => { return `${i}:...`
  // result: ['${', 'fruit.map((fruit, i) => ', '{', 'return `', '${', 'i', '}', ':'....]
  const chars = source.split(/(\${|[}{])/).filter(x => x);
  let exprStart = 0;
  let depth = 0;

  for (const [i, char] of chars.entries()) {
    // test if char matches opening braces for expression
    if (char === '${') {
      depth += 1;
      // if depth is 1 expression starts here
      if (depth === 1) {
        exprStart = i + 1;
      }
    }
    // if inside of that expression
    if (depth >= 1) {
      if (char === '{') {
        depth += 1;
      } else if (char === '}') {
        depth -= 1;

        // have we balanced the opening braces for expression
        if (depth === 0) {
          matches.push({ expr: chars.slice(exprStart, i).join('') });
        }
      }
    } else {
      // push the non expression chars
      matches.push({ text: char });
    }
  }

  return matches;
}

I split the text on braces, I balance those braces using a depth variable and join the expressions.

For example given an input of:

some t{ext here ${items.map((item, i) => { return ${i}:${item} }).join(‘\n’)} something here ${varname} text here

My output is

Array(7) [
  { text: 'some t' },
  { text: '{' },
  { text: 'ext here ' },
  {
    expr: "items.map((item, i) => { return `${i}:${item}` }).join('\n')"
  },
  { text: ' something here  ' },
  { expr: 'varname' },
  { text: ' text here' }
]

My thinking is the expressions could then be passed recursively into the highlighter.

I know it’s flawed in that placing a rogue brace inside of an inner template string will throw it e.g.
...{ return `} ${i}:${item}` }...

I tried asking chatGPT to do a refactor, which was an experience.

Just wondered if anyone had any input. I’m sure what I have done is over engineered.

codepen here

Is it… legal to have a } inside of a template literal’s variable declaration? Would depth ever be greater than 1? (if not… you just match ${.*?})

If you split on ${, you find all of the template variables you need to handle. (A template variable cannot contain a template variable, or another template literal).

If the return was of size 1, there was no variables to handle.
Otherwise, from chunk 2 onward, foreach chunk:
Split on }.
The last element of the result is outside of the template variable, and should be handled as such.
The rest of the result should be re-joined on } and parsed as an expression. The result of that parse then… becomes a node in the output in a tree-like fashion(?)

Your original example appears to be an illegal template literal, which might be causing you problems?
(Or i’m reading the spec wrong? But It reads to me as the template literal cannot contain a template literal, nor can a template variable contain another template variable)

I don’t know if this helps, but the following templateStrg

const items = ['mobile', 'tablet', 'laptop']
const variableName = "'my variable'"

const templateStrg = `A list of items: ${items.map((item, i) => { return `{${i+1}:${item}}`; }).join(', ')}, and a variable ${variableName}, the end.`;

Would output

A list of items: {1:mobile}, {2:tablet}, {3:laptop}, and a variable ‘my variable’, the end.

You can see we have a bit of nesting going on in that string.

BTW, I have been exploring an alternative in the form of nested regular expression. Post to come.

I have got my Mastering Regular Expressions out and have been exploring nested constructs. I’m sure trivial to some here.

To start with just a test on nested braces

const stringToMatch = '{a, b, {c, {d, e}, f} g, h}'

const level1 = new RegExp('\\{(?:[^{}])*\\}')
console.log(level1.source) // \{(?:[^{}])*\}

const level2 = new RegExp(`\\{(?:[^{}]|${level1.source})*\\}`)
console.log(level2.source) // \{(?:[^{}]|\{(?:[^{}])*\})*\}

const level3 = new RegExp(`\\{(?:[^{}]|${level2.source})*\\}`)
console.log(level3.source) // \{(?:[^{}]|\{(?:[^{}]|\{(?:[^{}])*\})*\})*\}

console.log(stringToMatch.match(level1)) // {d, e}
console.log(stringToMatch.match(level2)) // {c, {d, e}, f}
console.log(stringToMatch.match(level3)) // {a, b, {c, {d, e}, f} g, h}

Essentially the above is a bit of manual recursion using the pipe operator and grabbing the previous regex as the alternation.

Constructing that by hand is a bit laborious, so I wrote a function to do this.

function buildNestedRegex([part1, part2], depth = 1) {
  let regex = new RegExp(`${part1}${part2}`);
  while (--depth) {
    regex = new RegExp(`${part1}|${regex.source}${part2}`)
  }
  return regex.source
}

This can then be used as follows:

const level3 = buildNestedRegex(['\\{(?:[^{}]', ')*\\}'], 3)

console.log(stringToMatch.match(level3)) // {a, b, {c, {d, e}, f} g, h}

Putting this into practice for a template expression match

const [templateOpen, templateClose] = ['\\$\\{(?<expr>(?:[^{}]|', ')*)\\}']
const braces = ['\\{(?:[^{}]', ')*\\}']
const templateExpr = new RegExp(`${templateOpen}${buildNestedRegex(braces, 3)}${templateClose}`, 'g')

const testStrg = "An expression to match ${items.map((item, i) => { return `${i}:${item}`; }).join('\\n')} and a second ${variableName} the end."

console.log([...testStrg.matchAll(templateExpr)]
    .map((match, i) => `expression${i+1}: ${match?.groups?.expr}`)
    .join('\n'))

Outputs:

expression1: items.map((item, i) => { return `${i}:${item}`; }).join('\n')
expression2: variableName

Codepen here

And regex101 test here with the generated regex
https://regex101.com/r/5lMHeW/1

Still very much a work in progress. The tricky bit will be matching the rest of the template string e.g. outer backticks and everything inside (including inner backticks).

I’m not going to pretend that regex makes ANY sense to me whatsoever, it has devolved into a mess of symbols.

Wouldnt it have just been easier to walk the string to find the index of the matching close parenthesis?

let parts = input.split("${");
output.push(parts.shift()) //catch whatever comes before the first ${.
foreach(part in parts) {
  let depth = 1;
  let index = 0;
  while(depth) {
     switch part[index]:
         case '{': depth++; break;
         case '}': depth--; break;
     }
     index++;
   }
   output.push(part.substring(0,index - 1)); //The expression. Maybe recurse?
   output.push(part.substring(index)); //Between-variable bits.
  }

EDIT: But that would catch tempates-inside-templates, which is a stupid design decision IMO… sigh. You’d have to not split, and walk the original string in its entirity, because you can put literals inside literals and make parsing infinitely more complex. (because where does the literal end? You dont know by walking the string, because you cant just look for a backtick - the literal begins AND ends with the same marker)

‘mess’ is a bit harsh (it was thought out :slight_smile: ), but I get your point. It was an itch that needed scratching. Prism.js uses a similar regex for template strings

`(?:\\[\s\S]|\$\{(?:[^{}]|\{(?:[^{}]|\{[^}]*\})*\})+\}|(?!\$\{)[^\\`])*`

It was making my head hurt, trying to break it down, which is why I decided to do my own research into nested expressions.

But that would catch tempates-inside-templates, which is a stupid design decision IMO

Not sure I agree with that. I find them useful for generating HTML e.g.

const listOfPeople = (people) => (`
    <h2>List of people</h2>
    <ul>
        ${people.map((person, i) => `<li>${i+1}: ${person}</li>`).join('\n')}
    </ul>
`)

peopleHTML = listOfPeople(['Bob', 'Sue', 'Rita'])

Your code is more concise and I like the use of switch, where as mine had all sorts of nested if/elses. I will give it a go.

Well I know it’s structured, but I meant to try and read it :stuck_out_tongue:

I’m going to be honest, I dont like putting that much logic in the middle of an output string?

It is a common pattern, but I do agree to some extent. If it got anymore complex than that, I would probably refactor that inner code into it’s own function.

${people.map(getListItem)} or something similar

return `
<h2>A List.</h2>
<ul>
${ people.map((person) => { `<li> ${
    if (person.name.contains("ohn") {
        doThisStuff(person.name)
    } else {
       for (i = 0; i < person.name.length; i += 2) {
           person.name[i] = person.name[i].toUpperCase();
        }
     };
     //Do You
     //Remember That
     // You're inside two template literals now?

My head is hurting now @m_hutley,

Not sure how well it would solve the problem, but as I say that if/else code I would move to it’s own function.

`${ people.map(getPerson) }`

edit: I am possibly missing the point, in that I would have to be able to match your version sucessfully

peoplestring = getPeople()
`Some stuff ${peoplestring}`

Personally, I find that infinitely more readable, especially if you’ve got more than one variable.

But i’m arguing semantics now, have derailed the conversation. apologies.

No worries :smiley:

It’s food for thought.

It comes back to a point of how do you actually parse a string that contains template literals. FSA, presumably.

(Psuedocoding at this point)

function tokenizeCode(string input) {
   while(input.length) {
      switch input[0]:
        case '`': [input,output] = parseLiteral(input.substring(1)); finalout.push(output); break;
        ...
   }
}
function parseLiteral(string input) {
     let finaloutput = [];
     let currentliteral = "";
     while(input.length) {
        switch input[0]:
          case '$':
              if(input[1] == '{') { //We found a template variable.
                 finaloutput.push(currentliteral); //Terminate the current literal string.
                 currentliteral = ""; //Reset it for what comes after.
                 [input,output] = parseExpression(input.substring(2)); //Go figure out the token structure of the thing inside the expression.
                 finaloutput.push(output); //Add that token structure to our tree.
               } else { 
                  currentliteral += input[0]; //It was just a $. Throw it in as a literal character.
               }
               break;
               case '`': //We are at the end of our template.
                  finaloutput.push(currentliteral); //Write the end of our template to the output.
                  return [input.substring(1),finaloutput]; // Return the output to tokenize code, along with the part of the string we didnt consume.
                  break;
               default:
                  currentliteral += input[0]; //We're just puttering along, adding literal characters to the token.
               break;
               input = input.substring(1);
  } //EndWhile
  throw new Exception("Unexpected End Of File"); //Cause we got to the end of the input while in the middle of a literal.
} //EndFunc

I have got something working. I’m not convinced it is particulalry robust. It is very much dependent on the order matches and parses are done, strings before template etc.

It could certainly do with some refactoring, but frazzled right now.

I match the complete template string with this (took some trial and error)

/(?<templateString>`[\s\S]*?`(?=[\n\r])|`[\s\S]*?`)/

regex101 example here

I then strip the backticks of each end and pass it back into the highlighter parser, where it is then picked up by the expressions match

/(?<expressionOpen>\$\{)(?<expression>(?:[^{}]|\{(?:[^{}]|\{(?:[^{}]|\{(?:[^{}])*\})*\})*\})*)(?<expressionClosed>\})/

This in turn passes the expressions into the highlight parser, and so on.

Full horror show here :biggrin:
https://assets.codepen.io/3351103/js-regexes.js

My approach of relying on regexes is possibly a bit lightweight. For instance my template literal regex, wrongly matches the following

return addSpans(type, '`' + highlightJS(escapeTags(value.slice(1, -1))) + '`');
// matches
... '  →`' + highlightJS(escapeTags(value.slice(1, -1))) + '`← ' ...

It’s only that the preceding strings regex, grabs these first '`', that I get away with it.

Looking at the public domain versions out there, from what I can tell they are verging on being linters, and analyse the structure of the code. Is it a for … loop, an if/else block etc.

Maybe an approach would be to have a series of functions, like your parseLiteral, which are run in sequence rather than solely relying on alternated regexes.

basically what my code is leaning towards is… well, frankly, recreating a javascript engine :stuck_out_tongue: or at least, a tokenizer for javascript code, which would normally be part of a compiler. Trick is you never know how deep a function chain like the one i created is going to go… if you’ve got a literal inside an expression inside a literal inside and expression inside…ad infinitum…

2 Likes

Maybe we can avoid some of the chain problem by… literally reverting to a FSA.
(Now i’m gonna go even more pseudocode/theory. Note that I am NOT saying that this is a viable stretegy. It’s a nightmare.)…

let transitions = {
"EXPR": { "`": "LITRL" .... } //Every state transition for an expression. Good luck.
"LITRL": { "`": "ENDSTATE", "$": () => (input[1] == "{") ? "EXPR" : "NOTRANS" .... }
}
function tokenizeCode(input) { //Input: String
  let statestack = [];
  //Assume expression.
  let currentstate = "EXPR";
  //Load state transitions for the current state.
  let currenttransitions = transitions[currentstate];
  let currentcontent = "";
  while(input.length) { //Our main loop.
     if(currenttransitions.hasOwnProperty(input[0])) {
       let nextstate = (typeof(currenttransitions[input[0]]) == "function") ? currenttransitions[input[0]() : currenttransitions[input[0]]
       if (nextstate == "NOTRANS") { currentcontent += input[0]; }
       else if(nextstate == "ENDSTATE") { //We finished that. Move back.
             finalout.push({"state": currentstate, "content": currentcontent});
             currentcontent = "";
             currentstate =  statestack.pop();
             currenttransitions = transitions[currentstate];
        } else {
             finalout.push({"state": currentstate, "content": currentcontent});
             currentcontent = "";
             currentstate =  nextstate;
             currenttransitions = transitions[currentstate];
       }
  } else { currentcontent += input[0]; }
  input = input.substring(1);
} //endwhile
...

PS: Google tells me https://github.com/lydell/js-tokens this exists? May be usable…