How to create a list of values with tree walker?

In this webpage there are names of Wikipedia special pages, i.e. pages of the “Special:” namespace.
The names are scattered throughout this long webpage.

I can match the names, download them, and sort them in a list via the shell, this way:

curl https://en.wikipedia.org/wiki/Help:Special_page -s | grep -oP 'Special:\K[a-zA-Z0-9]*' | sort -u > special_page_names

JavaScript

Primarily for the sake of learning and experiment I ask.
Is there a way to save the names to clipboard, similarly filtered (as with grep and sort) via JavaScript tree walker?

const regex = /Special:\K[a-zA-Z0-9]*/
const walker = document.createTreeWalker(
  document.body, 
  NodeFilter.SHOW_TEXT
)
let node;
while ((node = walker.nextNode())) {
    // CODE FOR COPYING SPECIAL PAGE NAMES TO CLIPBOARD COMES HERE
}

Well firstly I would adjust that while loop so that it doesn’t do variable assignment in the condition area.

        let node = walker.nextNode();
        while (node) {
            // CODE FOR COPYING SPECIAL PAGE NAMES TO CLIPBOARD COMES HERE
            ...
            node = walker.nextNode();
        }
}

With the regex, JavaScript doesn’t support using \K. Instead we use capture groups.

        const regex = /Special:([a-zA-Z0-9]*)/

Then I would use that loop for populating an array of special lines.

        const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
        const specialLines = [];
        const regex = /Special:([a-zA-Z0-9]*)/
        let node = walker.nextNode();
        while (node) {
            if (regex.test(node.textContent)) {
                specialLines.push(node.textContent);
            }
            node = walker.nextNode();
        }

Beyond there, we use map to get the capture group of the regex.

        const specialTerms = specialLines.map(function getTerm(line) {
            return line.match(regex)[1];
        });

I could have done that in the while loop, by by doing it outside of the loop we reduce the complexity of the code and make it easier to understand.

Then it’s just a matter of copying that specialTerms array to the clipboard, with a similar output to console.log in case writing to the clipboard doesn’t work.

        if (navigator && navigator.clipboard && navigator.clipboard.writeText) {
            navigator.clipboard.writeText(specialTerms.join(" "));
        }
        console.log(specialTerms.join(" "));

Here’s the full code.

        const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
        const specialLines = [];
        const regex = /Special:([a-zA-Z0-9]*)/
        let node = walker.nextNode();
        while (node) {
            if (regex.test(node.textContent)) {
                specialLines.push(node.textContent);
            }
            node = walker.nextNode();
        }
        const specialTerms = specialLines.map(function getTerm(line) {
            return line.match(regex)[1];
        });
        if (navigator && navigator.clipboard && navigator.clipboard.writeText) {
            navigator.clipboard.writeText(specialTerms.join(", "));
        }
        console.log(specialTerms.join(" "));

All of that could be done inside of the while loop, but doing it the way I’ve done above helps to reduce complexity of what’s going on.

2 Likes

Instead of using the while loop to filter things, the createTreeWalker page shows how to use the acceptNode filter.

const regex = /Special:([a-zA-Z0-9]*)/
const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT, {
    acceptNode: function regexFilter(node) {
        if (regex.test(node.textContent)) {
            return NodeFilter.FILTER_ACCEPT;
        }
    }
});
const specialLines = [];
let node = walker.currentNode;
while (node) {
    specialLines.push(node.textContent);
    node = walker.nextNode();
}

I also made the starting condition of the node more explicit, as it demonstrated on the documentation page too.

Then we can move the bulk of the code out to a separate nodesToFilteredText() function.

function nodesToFilteredText(nodesParent, regex) {
    const walker = document.createTreeWalker(nodesParent, NodeFilter.SHOW_TEXT, {
        acceptNode: function regexFilter(node) {
            if (regex.test(node.textContent)) {
                return NodeFilter.FILTER_ACCEPT;
            }
        }
    });
    const specialLines = [];
    let node = walker.currentNode;
    while (node) {
        specialLines.push(node.textContent);
        node = walker.nextNode();
    }
    return specialLines.map(function (line) {
        return line.match(regex)[1];
    });
}
const specialTerms = nodesToFilteredText(document.body, /Special:([a-zA-Z0-9]*)/);

We can also use a separate function that calls nodesToFilteredText(), and adds it to the clipboard.

function copyFilteredNodesToClipboard(nodesParent, regexFilter) {
    const specialTerms = nodesToFilteredText(nodesParent, regexFilter);
    if (navigator && navigator.clipboard && navigator.clipboard.writeText) {
        navigator.clipboard.writeText(specialTerms.join(", "));
    }
    console.log(specialTerms.join(" "));
}
copyFilteredNodesToClipboard(document.body, /Special:([a-zA-Z0-9]*)/);

That way those functions can be preloaded into the browser console, and you can run things just with the following line:

copyFilteredNodesToClipboard(document.body, /Special:([a-zA-Z0-9]*)/);
2 Likes

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.