Regex to get domain from email address

Hello

I am looking for a regex or a way to get the domain name (without the subdomain) from an email address.

For example, in case of the email address phantom@subdomain.domain.co.in the result should be domain.co.in

or hello@example.com should return example.com

I am currently using the following code but it fails for some cases

$email = 'hi@sub.domain.gov.net';
preg_match('/@((([^.]+)\.)+)([a-zA-Z]{2,}|[a-zA-Z.]{5,})/', $email, $emailMatches);

$emailDomain     = isset($emailMatches[3]) && isset($emailMatches[4]) ? $emailMatches[3] . '.' . $emailMatches[4] : $email; // Gets just the domain (without a sub-domain)

But it fails for cases like hi@sub.domain.gov.net

Does anyone knows an elegant way to get a domain name (without a subdomain) from an email address using php?

Thanks

if you explode the email address on ‘.’ then the domain will be in the last entry of the array - in, com and net in your three examples.

The problem is knowing how many elements of the array make up the actual domain name, and where the subdomains (if any) are. I don’t know much about regex so I can’t think of how to do even simple things with that, but I also can’t think of a way of doing it reliably at all, except for progressively adding an extra element backwards from the end and doing a domain look-up to see if it exists.

I must be missing something. How does that help the OP? My understanding is he wants domain.gov.net in his example.

I was asking for clarification - domain.gov.net is a subdomain of gov.net so by the criteria provided gov.net is a possible answer - also gov.net is a subdomain of net so also by the criteria provided net would also be a possible answer.

The OP needs to define what they mean as the difference between a domain and a subdomain as anything in front of a dot is a subdomain of what comes after the dot.

My understanding is that in terms of getting domains from email addresses:

Anything before and including the @ can be disregarded.

That leaves (going from right to left) Top Level Domain preceded by Second Level Domain …
Then optionally preceded by n Level Domains up to effectively no limit

That is, something like this would be considered valid.
name@ab.cd.ef.gh.ij.kl.mn.no.pq.rs.tu.vw.xy.com

So it should be easy enough to not capture the TLD, but I don’t see any easy way to parse out anything more that would apply to all possible, other than just capture what’s left over after that.

1 Like

Unfortunately the OP hasn’t been back to clarify his requirement yet…

Thanks everyone for your responses.

I realised that getting a domain name (without a subdomain) is not possible because the domain extension can be 1,2 or even 3 words. So even if we use some kind of techique, it wil be a patch and there will be no gurantee that the result is what i wanted.

Thanks

1 Like

It is possible but not automatically with a generic regex. From what you wrote it seems like there are only two possibilities for the psedo top level domain - 2 or 3 segments, like example.com, domain.co.in, domain.gov.net, etc. I would simply make a list of all possible second-level domains and check if the last 2 segments can be found in the list - if yes then this means the last 3 segments are the main
domain you are interested in instead of the standard 2 - then it’s easy to extract the subdomain you want. Even simple explode() functions would do it, no need for a regex.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.