Highlighting JS

I am looking to import and parse markdown from a MongoDB database. I have already achieved that using a package called ‘marked’.

I previously wrote a parser in PHP to style the javascript codeblocks, and I am now re-writing that in JS.

Basically it wraps span tags around keywords, comments etc. e.g.

<span class='keyword'>const</span> x = <span class='string'>'A string'</span>

The following seems to work, but wondered if there was a slicker approach.

Sample text file as source:

const rx = /([0-9]+)|([a-z]+)/gmi

const highlightJS = (markup) => {
  return markup.replaceAll(tokensRx, (...args) => {
    // last index is named matches e.g. 
    // { string: undefined, comment: '// comment here', ...}
    const [matches] = args.slice(-1)

    for (const type in matches) {
      if (matches[type] !== undefined) {
        return `<span class='${type}'>${matches[type]}</span>`
      }
    }
  })
}

Javascript Parser

// Using named capture groups
const tokens = [
  '(?<strings>([\\u0027"`])[\\s|\\S]*?(?<!\\u005C)\\2)',
  '(?<comments>(?<!:)\\u002F{2}.*|\\u002F\\u002A[\\s\\S]*?\\u002A\\u002F)',
  '(?<regex>(?<!\\/)\\/[^\\/]+\\/[a-zA-Z]{0,3})',
  '(?<spread>(\\.{3}))',
  '(?<props>(?<=\\w\\.)\\w+)',
  '(?<numbers>\\b\\d+(?:\\.\\d+)?\\b)',
  '\\b(?<keywords>abstract|arguments|await|boolean|break|byte|case|catch|char|class(?!=\s*?=)|const|continue|debugger|default|delete|do|double|else|enum|eval|export|extends|false|final|finally|float|for|function|goto|if|implements|import|in|instanceof|int|interface|isNaN|let|long|native|new|null|package|private|protected|public|return|short|static|super|switch|synchronized|this|throw|throws|transient|true|try|typeof|undefined|var|void|volatile|while|with|yield)\\b',
  '\\b(?<builtIn>Object|Array|Function|String|Number|null|undefined|Symbol|BigInt)\\b',
  '(?<brackets>[\\{\\(\\[\\]\\)\\}])'
]

// join all regexes
const tokensRx = new RegExp(tokens.join('|'), 'gm')

const highlightJS = (markup) => {
  return markup.replaceAll(tokensRx, (...args) => {
    // last index contains named matches e.g. 
    // { strings: undefined, comment: '// comment here', regex: undefined ...}
    const [matches] = args.slice(-1)
    
    for (const tokenType in matches) {
      if (matches[tokenType] !== undefined) {
        const span = document.createElement('span')
        
        span.className = tokenType
        span.textContent = matches[tokenType]
        return span.outerHTML
      }
    }
  })
}

Here is a codepen showing the output

I am aware of preexisting packages like highlightJS.

A quick play last night with highlightJS, and it seems that everything needs to be rendered to the DOM before it then styles code blocks.

With my approach I will be styling the HTML prior to rendering to the page with EJS.

1 Like

A bit of an update, now with line numbers

To be honest,

your first CodePen gives me an error:

SyntaxError: Invalid regular expression: invalid group specifier name
at https://cdpn.io/cpe/boomboom/pen.js?key=pen.js-d70510e7-fb82-827a-a5c7-5193bda821cb:14

and the second one an empty page and console.

Thanks for taking a look @Thallius . That’s no good!!

I have checked in chrome, opera, firefox, brave and edge and it works fine for me.

What browser are you using? (I’m guessing Safari)

The link you have provided, just comes up with ‘Not Found’.

edit: This thread might shine some light on the issue

Safari, doesn’t support look behinds yet. Only had four years to implement this.

To be honest I never understood why people use such extensive regex.
I do not understand a bit of your regex phrase and I don’t think it is much faster as if you would write a few lines of code which do the same but can be understand by javascript programmers and you do not need regex professionals.

But of course if you spend so much time in understanding regex it might be good for you. But I think you make it much harder for other people to understand what you are doing.

Because what he’s trying to do is write a Javascript syntactical parser. Inside the javascript syntactical parser. :wink:

Now I’m confused lol. edit (Took a minute. yes)

I am taking out a couple of the look behinds.

To give an example, matching comments I have used a negative lookbehind

/(?<comment>(?<!:)\u002F{2}.*| .../

\u002F is a forward slash {2} 2 times e.g. // followed by .* any number of random characters on that line.

I don’t want it to match https://address though

so have added (?<!:) which will only match the // if not preceded by a colon :

I would be interested in how you would go about this without regexes.

This is something I am using for my site, and is using NodeJS, so I would guess browsers are not going to be an issue?

I get your point. When I first learnt them I wanted to use them for everything. Not so these days, however I think they are appropriate for this task.

I wouldnt :stuck_out_tongue:

What do you do with nested tags, though?

@m_hutley

There has been an element of winging it. I may well get caught out, but the order that the regexes are processed does play a part. String before comments etc.

It is kind of why I asked the question about a slicker solution. I have a feeling something recursive, might be an option, but feeling a bit dense of late.

I’m not banking on it, but does this version work for you @Thallius ?

I have removed all lookbehinds.

Testing safari on PC is a bit of an issue.

No it is not working. Looks like it is the strings const. When I replace it by /a/ it shows me some output.

1 Like

Ok @Thallius, I have removed the look behind, missed that one. It was there for ignoring escaped quotes inside strings. Will have to look for another approach.

Dare I ask, does it work now?

Will have to look at a way of setting safari. I’m guessing it is a bit involved VMs etc.

Edit; Just tested on lambdatest with safari, it seems to be working for me.??

seems clear as for me