Regex help if you please, shouldn't be a hard one!

I have a situation where I need to remove a HTML link from a string. This is simple enough, just capture the <a href up to the </a> right? I can do a lazy search on this pretty easy, works fine.

But in most cases I won’t have the entire HTML to search, because the app truncates the text. For example it might look like this:
some text and <a href="part of a link" target="_bl... some more different text

If the link is truncated, I still want to remove it so it reads “some text and … some more different text” or whatever. I just don’t want the html in there.

I need the regex to use Javascript’s string.replace() method.

What I’m finding is that it’s hard to capture the two alternate conditions with an OR while also using character classes and even a caret ^ “not” rather than .* everything. I try to capture “up to” the ellipses but for some reason the capture includes one of the periods which I don’t want.

So capture EITHER <a href up to (and not including) the three dots, OR from <a href to </a>.

Struggling a little on this one.

You might appreciate the following post on the dangers of parsing HTML using regex

I recommend that you instead use some conditional statements to figure out which type of situation that you’re dealing with there, and then use separate functions to handle each type separately.

1 Like

It’s not a mission critical application, just annoying to see raw HTML so I want to blank it out.

Are you saying it would be too difficult to try and use some “OR” logic in a single regex?

You could try
[some-parent-element].getElementsByTagName('a')
then set each to its textContent in a loop.

Hi there!

You can parse HTML using document.createElement and setting its innerHTML.
This way, you have all DOM selecting and manipulation tools available without touching a string manually.

This is how I have come up with the solution:

Here is the function:

function parseHTML(html) {
    let div = document.createElement('div')
    div.innerHTML = html
    return div
}

var html = parseHTML`<a href="#">test!</a> string outside`
html.querySelectorAll('a').forEach(item=>item.remove())
console.log(html.innerHTML)

There’s a problem Martin - in that your code doesn’t achieve what the OP is requiring. Here’s what he’s asking for.

So removing the partial HTML up to the three period, is what’s required in this situation.

Oh! I forgot that he was only able to get the truncated HTML. Is there a better solution than regex in this case then? Sorry for the misleading answer.

With his current situation, he isn’t even able to make it a HTML element and then just get the textContent. This sucks. :neutral_face:

Yes :slight_smile:

And yet, conceptually it doesn’t seem like a hard query if I can just find <a href up until the ellipsis. For some reason this is not very easy, especially when I want to also catch the alternative for when the complete html tag actually is intact.

Note that this will only ever be an <a> link, not any other kind of html. And if the string is concatenated before any links show up, there is nothing to find. So it will always be an <a> and it will always be three dots when concatenated.

I’ve come up with this one @zack1:

Let’s do this more appropriately, using tests to help drive the code.

We can use Jasmine for testing, and the info on starting with a simple project.

The first test is quite easy, where we start with the most basic test possible, to help ensure that the plumbing works correctly.

describe("Remove link from text", function () {
    it("contains spec with an expectation", function() {
        expect(true).toBe(true);
    });
});

With that as your spec, and SpecRunner.html containing a reference to your spec, things should go well.

  <!-- include spec files here... -->
  <script src="spec/removeLinkFromText.spec.js"></script>

When running SpecRunner with only that one spec, you should see 1 spec, 0 failures

The first main test can now be done in regard to the link text. There seem to be several types of possibilities, that involve the ellipses and link.

The ellipses have two possibilities, being either there or not. The link can either be there, incomplete, or missing, as well as before or after the ellipses.

That gives several different possible configurations:

  • some text and some more different text
  • some text and… some more different text
  • some text and <a href=“”>also some more different text
  • some <a href=“#”>text
  • some text and <a href=“part of a link” target=“_blank”>also</a>… some more different text
  • some text and <a href=“part of a link” target="_bl… some more different text
  • some text and… <a href=“part of a link” target=“_blank”>also</a> some more different text
  • some text and… target=“_blank”>also</a> some more different text
  • some text and… /a> some more different text
  • some text and <a href=“part of a link” target=“_blank”>… also</a> some more different text

The first type of text can be given a test that looks like:

    describe("text has no link", function () {
        it("and no ellipses", function () {
            var text = "this text contains no link or ellipses";
            expect(removeLinkFromText(text)).toBe(text);
        });
    });

Running SpecRunner.html gives us an error, because the removeLinkFromText() function wasn’t found. We need to create the `removeLinkFromText.js’ file and put in it the expected function:

function removeLinkFromText() {
}

and use that file it in the spec runner as our source file.

  <!-- include source files here... -->
  <script src="src/removeLinkFromText.js"></script>

You should now see a different SpecRunner error, saying Expected undefined to be 'this text contains no link or ellipses'

Update the script to return what is given to the function, and our first simple test should pass.

function removeLinkFromText(text) {
    return text;
}

The SpecRunner is now all green, so we can refactor the code if we want to, before adding another test.

    describe("text has no link", function () {
        ...
        it("but has ellipses", function () {
            var text = "this text contains ... no link";
            expect(removeLinkFromText(text)).toBe(text);
        });
    });

Run the test runner and the tests should still pass. These are basic tests, but will be useful when we add code to split up the text, to help ensure that what comes goes in will still correctly come out.

Next we can work on removing a link from the text, starting with a test.

describe("text has link", function () {
    it("removes a html link from the text", function () {
        var text = "some text and <a href=\"\">also</a> some more different text";
        expect(removeLinkFromText(text)).toBe("some text and also some more different text");
    });
});

When we run the test we see that it’s red. To start making progress on this, we’ll return early with the existing working code. Then we can separate the different parts of the link and return the combined text.

function removeLinkFromText(text) {
    var linkStart = text.indexOf("<a ");
    var linkEnd = text.indexOf("</a>");
    if (linkStart === -1 && linkEnd === -1) {
        return text;
    };
    var beforeText = text.substring(0, linkStart);
    var insideText = text.substring(text.indexOf(">", linkStart) + 1, linkEnd);
    var afterText = text.substring(linkEnd + 4, text.length);
    return beforeText + insideText + afterText;
}

The test now turns green, and we can refactor while attempting to keep the test in the green.

function removeLinkFromText(text) {
    var filteredText = "";
    var linkStart = text.indexOf("<a ");
    var linkEnd = text.indexOf("</a>");
    var beforeText = "";
    var insideText = "";
    var afterText = "";
    if (linkStart === -1 && linkEnd === -1) {
        filteredText = text;
    };
    if (linkStart > -1 && linkEnd > -1) {
        beforeText = text.substring(0, linkStart);
        insideText = text.substring(text.indexOf(">", linkStart) + 1, linkEnd);
        afterText = text.substring(linkEnd + 4, text.length);
        filteredText = beforeText + insideText + afterText;
    }
    return filteredText;
}

The code is improved, and we can move on to the next test.

    it("removes a link if it has no end", function () {
        var text = "some text and <a href=\"\">also";
        expect(removeLinkFromText(text)).toBe("some text and also");
    });

After getting a red test, the code to make it a passing test is fairly easy.

    if (linkStart > -1 && linkEnd === -1) {
        beforeText = text.substring(0, linkStart);
        insideText = text.substring(text.indexOf(">", linkStart) + 1, text.length);
        filteredText = beforeText + insideText;
    }

Refactoring this though is the next challenge. We can move some of the code out to separate functions, such as the insideLinkText() and afterLinkText().

function insideLinkText(text, linkStart, linkEnd) {
    var contentEnd = linkEnd;
    if (contentEnd === -1) {
        contentEnd = text.length;
    }
    return text.substring(text.indexOf(">", linkStart) + 1, contentEnd);
}
function AfterLinkText(text, linkEnd) {
    var textStart = linkEnd + 4;
    if (linkEnd === -1) {
        textStart = text.length;
    }
    return text.substring(textStart, text.length);
}

Those above functions now allow us to pass the same information for the insideText and afterText variables, which allows us to condense both of the linkStart > 1 if statements in to the one statement:

    if (linkStart > -1) {
        beforeText = text.substring(0, linkStart);
        insideText = insideLinkText(text, linkStart, linkEnd);
        afterText = AfterLinkText(text, linkEnd);
        filteredText = beforeText + insideText + afterText;
    }

It also makes sense to move the no-link if statement down near the bottom of the function.

    if (linkStart > -1) {
        ...
    }
    if (linkStart === -1 && linkEnd === -1) {
        filteredText = text;
    };
    return filteredText;

The tests still pass, so we can move on to the next test, where we start working on the ellipses.

    it("removes a link from before ellipses", function () {
        var text = "some text and <a href=\"part of a link\" target=\"_blank\">also</a>... some more different text";
        expect(removeLinkFromText(text)).toBe("some text and also some more different text");
    });

The rest can be left as an exercise, where the remainder of the tests are worked on, but this makes for a good start.

The test code is currently:

/*jslint browser */
/*global decribe, it, expect */
describe("Remove link from text", function () {
    it("contains spec with an expectation", function() {
        expect(true).toBe(true);
    });
    describe("text has no link", function () {
        it("and no ellipses", function () {
            var text = "this text contains no link or ellipses";
            expect(removeLinkFromText(text)).toBe(text);
        });
        it("but has ellipses", function () {
            var text = "this text contains ... no link";
            expect(removeLinkFromText(text)).toBe(text);
        });
    });
});
describe("text has link", function () {
    it("removes a html link from the text", function () {
        var text = "some text and <a href=\"\">also</a> some more different text";
        expect(removeLinkFromText(text)).toBe("some text and also some more different text");
    });
    it("removes a link if it has no end", function () {
        var text = "some text and <a href=\"\">also";
        expect(removeLinkFromText(text)).toBe("some text and also");
    });
    it("removes a link from before ellipses", function () {
        var text = "some text and <a href=\"part of a link\" target=\"_blank\">also</a>... some more different text";
        expect(removeLinkFromText(text)).toBe("some text and also some more different text");
    });
// some text and <a href="part of a link" target="_blank">also</a>... some more different text
// some text and <a href="part of a link" target="_bl... some more different text
// some text and... <a href="part of a link" target="_blank">also</a> some more different text
// some text and... target="_blank">also</a> some more different text
// some text and... /a> some more different text
// some text and <a href="part of a link" target="_blank">... also</a> some more different text
});

And the current code for those tests is:

/*jslint browser */
function insideLinkText(text, linkStart, linkEnd) {
    var contentEnd = linkEnd;
    if (contentEnd === -1) {
        contentEnd = text.length;
    }
    return text.substring(text.indexOf(">", linkStart) + 1, contentEnd);
}
function AfterLinkText(text, linkEnd) {
    var textStart = linkEnd + 4;
    if (linkEnd === -1) {
        textStart = text.length;
    }
    return text.substring(textStart, text.length);
}
function removeLinkFromText(text) {
    var filteredText = "";
    var linkStart = text.indexOf("<a ");
    var linkEnd = text.indexOf("</a>");
    var beforeText = "";
    var insideText = "";
    var afterText = "";
    if (linkStart > -1) {
        beforeText = text.substring(0, linkStart);
        insideText = insideLinkText(text, linkStart, linkEnd);
        afterText = AfterLinkText(text, linkEnd);
        filteredText = beforeText + insideText + afterText;
    }
    if (linkStart === -1 && linkEnd === -1) {
        filteredText = text;
    };
    return filteredText;
}

The test has been left with a non-passing test, so that when we come back to the code we have immediate feedback that helps us to quickly figure out where to start working on it again.

1 Like

This got the job done for my use case. I tried to avoid using .* but it works.

Now you’re just showing off :smiley:
Very impressive

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.