Ignore a Part of a String Using REGEX

I’m trying to get only the domain name and the top level domain from URLs and ignore the rest using REGEX and I’m not sure if it’s correct although it does match what I’m looking for. In the URLs below I only want the domain name followed by dot net, dot com, dot biz, dot org, etc., and what ever comes after the top level domain:
http://example.com/maps
https://www.example.biz
http://www.example.org/test
http://www.example.net

I have used the REGEX pattern below to match what I need and it seems to work but somethings tells me it’s a fluke.
[^http:\/\/w*\.|https:\/\/w*\.]\w*\.\w*(\/\w*)?

The reason I have doubts about it working correctly is when I remove |https:\/\/w*\. from inside the square bracket it still continues to work.

Please do not use regex - it’s really not all that suitable and results in complex things that aren’t easily understood.

Use the location object, where the hostname property gives you exactly what you want.

2 Likes

Wasn’t aware of hostname, that is obviously preferable. I see it isn’t supported in Opera though.

To the Liagapi555, the negative square bracket notation isn’t a group match pattern

So [^abc] means do not match those characters e.g. a ‘c’, an ‘a’ or a ‘b’. It doesn’t mean do not match ‘abc’

The http/https pattern isn’t needed twice with an orI’ between them

https? would do the trick for that bit, with the question mark making the ‘s’ optional, 0 or 1 times

The reason deleting that chunk from your regex still works is because you haven’t set a start’s with ^

e.g. ^https?:\/{2}

If you want to experiment with regular expressions, regex101.com is very handy. It will even tell you the number of steps taken in your matches

This is a good site for learning about regex’s
https://www.regular-expressions.info/tutorial.html

Just playing around in regex, this was my attempt. Note I may well have not considered all edge cases, so it could possibly fail.

^https?:\/{2}(?:w{3}\.)?([^\/]+)\/?([^\/]+)?

Breakdown:

Start with http or https followed by 2 forward slashes. {2} indicates the number of characters
^https?:\/{2}

Then an optional group pattern of www. (?: ) is a non capturing group and the question mark after that again makes it optional
(?:w{3}\.)?

Then match everything up to an optional forward slash. This time in a capturing group using a negative character set [^\/] (anything that is not a forward slash)
([^\/]+)\/?

Lastly an optional negative character set inside a second capturing group, that will match anything that isn’t a forward slash (i.e. up to a possible next forward slash)
([^\/]+)?

As I say this may well fail. What if /test instead is /test.php?firstname=name

We could change the last bit to the following to exclude dots as well
([^\/.]+)?

I think regex’s are great, but the last bit does illustrate what Paul says, that if there is native tool like hostname, that is probably the better route.

The Opera issue is out of date. According to canIuse it is supported., and older versions of Opera are just unknown.

1 Like

Just to add to this, there is also pathname which could be useful. It returns the path that follows the first ‘/’ e.g.

'http://somewhere.com/images/image-01.jpg'

// pathname: '/images/image-01.jpg'

MDN Location: pathname

Thank you all for your help. I will look into using hostname and pathname in my work and see what I can do with them.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.