Hi,
My last question was not properly formatted and lacking sample data, so I am asking anew instead of editing it and creating confusion, if this is ok.
I’m working on a database table, hosted on MariaDB 10.6.2. It contains poorly formatted phone numbers and my goal is to extract those phone numbers. The extracted phone numbers should be 10-digit numbers and there should be no leading 1’s.
Table structure:
CREATE TABLE sample_testx (
id int(11) NOT NULL,
office_phone text DEFAULT NULL,
phone_number text DEFAULT NULL,
a_phone_number text DEFAULT NULL,
office text DEFAULT NULL,
good_phones varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Sample data:
INSERT INTO sample_testx (id, office_phone, phone_number, a_phone_number, office, good_phones) VALUES
(1, '301', '7725', '1(301)775', '1(301) 258-7757 EXT127\"', NULL),
(2, ' \"1(301) 254-8721\"', NULL, 'Direct:5152051266', '(205) 661-0811|(187) 737-19353', NULL),
(3, '650.3585100Ext.128', '1497005', '813-832-3985', '(888) 822-7434', NULL),
(4, '772-778-2131 x07/06', '248.557-1200 Ext', '407¿5901433', '<strong>Office:<-strong>', NULL),
(5, '954%560%4373', '727=-303-0889', ' ', '$4$-415-3575', NULL),
(6, '100090', NULL, ' \"1(301) 254-8721', NULL, NULL),
(7, '', NULL, 'error-404', '330-2225544a', NULL),
(8, '+1,324, 439,9114', '22290044-33', ' ', '...', NULL),
(9, '2,563,310,074', '20,575,566', '(334)794-0328 E', '205-655-8877 x2', NULL),
(10, '251-981-7470 x7487', '205-326-3000 x414-09', '[]', NULL, NULL),
(11, '(877) 890-0050|(256) 373-3450', '(256) 882-2223|(866) 218-0196|(256) 990-7653', NULL, '333 ext', NULL),
(12, '(256) 302-7611, (205) 661-0811, (256) 302-7611, (877) 371-9353, (877) 371-9353', '1(907) 252-2743\", \"1(907) 252-2743\", \"1(907) 252-2743', NULL, '555-444(3322)', NULL),
(13, 'Cellphone:515250', '1(650) 291-0470\", \"1(650) 291-0470\", \"1(DRE) :01-7482', '813-832-3985', '813-839-3463 x#', NULL),
(14, '989-686-3400 x#133', '248.557-1200 Ext', '305*335-0950', '561-859+5692', NULL),
(15, '573000000000', '239-472-5147`', '205.391-9903', '555-400-2030-6009008020-3338887744', NULL);
I want to extract phone numbers from the four columns office_phone, phone_number, a_phone_number, office and combine them comma separated into good_phones column. Some entries have non-numeric chars and leading 1’s. Also, some columns have multiple phone numbers to be extracted.
Here is my query so far:
UPDATE sample_testx AS s
JOIN (
WITH expanded_columns AS ( -- extract phones from multiple columns, preserve column order
SELECT id, office_phone AS phone_data, 1 AS col_num FROM sample_testx WHERE office_phone IS NOT NULL
UNION ALL
SELECT id, phone_number, 2 FROM sample_testx WHERE phone_number IS NOT NULL
UNION ALL
SELECT id, a_phone_number, 3 FROM sample_testx WHERE a_phone_number IS NOT NULL
UNION ALL
SELECT id, office, 4 FROM sample_testx WHERE office IS NOT NULL
)
, nn(part_num) AS (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) -- up to 10 entries per column
, data_parts AS (
SELECT *, SUBSTRING(dataN, (part_num - 1) * 10 + 1, 10) part
FROM (
SELECT id, col_num, phone_data
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(phone_data, '[(+.#]1', ''), '1[(+.#]', ''), '[^0-9]', '') dataN -- remove leading 1, remove all non-numeric chars
FROM expanded_columns
) a
LEFT JOIN nn ON part_num <= (length(dataN) + 9) / 10
)
SELECT id
, REPLACE(
GROUP_CONCAT(DISTINCT CASE WHEN LENGTH(part) = 10 THEN part END ORDER BY col_num, part_num SEPARATOR ',') -- combine unique 10-digit phones with ,
,',,',',') extracted_phones
FROM data_parts
GROUP BY id
ORDER BY id
) AS extracted
ON s.id = extracted.id
SET s.good_phones = extracted.extracted_phones;
This query checks the leading 1 in a manually specified number of scenarios in this regex line:
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(phone_data, '[(+.#]1', ''), '1[(+.#]', ''), '[^0-9]', '')
which are (+.# but it needs to work for any char before or after the leading 1, which I failed to achieve. I tried a “remove all non-numeric chars regex” but it didn’t work.
Any help will be appreciated, thanks.