GROUP domain from url in MySql

hello,
I have a large database that contains many urls, there are many domains repeating and i’m trying to get only the domain.
eg:

http://example.com/someurl.html
http://example.com/someurl_on_the_same_domain.html
http://example.net/myurl.php
http://example.org/anotherurl.php

and i want to get only domains, eg:

http://example.com
http://example.net
http://example.org

My query is:

SELECT id, SUBSTR(site, 1 , LOCATE('/', site, 8)-1) as domain
FROM table
GROUP BY domain 
ORDER BY id DESC 
LIMIT 50

this query is fine but it doesn’t retrieve domains only, eg. http://example.com

I think i need to use conditional functions here but i’m not mysql guru

yes, it is ~a~ solution, but it’s not a very good one

here, allow me to demonstrate

first, some test data –

CREATE TABLE test_urls
( id INTEGER NOT NULL PRIMARY KEY 
, site VARCHAR(99)
);
INSERT INTO test_urls VALUES
 (1,'http://example.com/test.html')
,(2,'http://www.example.com/test.html')
,(3,'example.com/test.html')
,(4,'www.example.com/test.html')
,(5,'https://example.com/test.html')
,(6,'https://www.example.com/test.html')
,(7,'http://maps.example.com/test.html')
;

now, let’s try buddy’s query –

SELECT
    SUBSTR(site, 1 , IF(LOCATE('/', site, 8), LOCATE('/', site, 8)-1, LENGTH(site)))
        as OnlyDomain
    FROM test_urls
    GROUP BY OnlyDomain
    ORDER BY id DESC LIMIT 50
;
[COLOR="Red"][B]OnlyDomain[/B]
http://maps.example.com
https:/
www.example.com
example.com
http://www.example.com
http://example.com[/COLOR]

and now, let’s try my query –

SELECT SUBSTRING_INDEX(REPLACE(REPLACE(site,'http://',''),'https://',''),'/',1) as domain 
  FROM test_urls          
GROUP 
    BY domain 
;
[COLOR="red"][B]domain[/B]
example.com
maps.example.com
www.example.com[/COLOR]

which results do you prefer?

:cool:

this is a solution i found somewhere else:

SELECT
    SUBSTR(site, 1 , IF(LOCATE('/', site, 8), LOCATE('/', site, 8)-1, LENGTH(site)))
        as OnlyDomain
    FROM tablename
    GROUP BY OnlyDomain
    ORDER BY id DESC LIMIT 50

credit: shamittomar

well thanks a lot for all the help,
i’ve leaned some new things.
I think i will read your book sometime

would you please at least try the query i gave you?

well, it worked! :eek:
but it’s really necessary to replace http/s ?

thx again!

thx for the replies i guess i’m not clear enough,
i have those url’s in my db fo example:

http://www.example.com
http://www.example.com/nyurl.net

with the current query i posted i’m only getting
http://www.example.com/nyurl.net

and not
http://www.example.com

@r937 thanks for the code but i didn’t need to replace http, i guess i didn’t explain myself right.

try this –

SELECT MAX(id) AS latest_id
     , SUBSTRING_INDEX(REPLACE(REPLACE(site,'http://',''),'https://',''),'/',1) as domain 
  FROM daTable          
GROUP 
    BY domain 
ORDER 
    BY latest_id DESC LIMIT 50 

i’m trying to explain that you ~do~ need to remove the http:// :slight_smile:

SELECT site
  FROM daTable
 WHERE site [COLOR="Blue"]NOT LIKE [/COLOR]'http://%'

I’m not exactly understand what are you trying to know?

so there are 79 rows without http:// on the front

what does that tell you?

:slight_smile:

rows 	necessary
2926 	2847

could you run the COUNT query in my previous post please

:slight_smile:

I have large database contains many referrals and i want to pull the last 50 domains and to avoid duplicates, so i need to remove the unwanted string and group the wanted results.

i can answer this if you would please tell me what this query produces,

SELECT COUNT(*) AS rows
     , COUNT(CASE WHEN site LIKE 'http://%'
                  THEN 'yes it is' 
                  ELSE NULL END) AS necessary
  FROM daTable

Isn’t http://example.com a domain? :shifty: