MariaDB query issue when extracting multiple phone numbers

Hi,

My last question was not properly formatted and lacking sample data, so I am asking anew instead of editing it and creating confusion, if this is ok.

I’m working on a database table, hosted on MariaDB 10.6.2. It contains poorly formatted phone numbers and my goal is to extract those phone numbers. The extracted phone numbers should be 10-digit numbers and there should be no leading 1’s.

Table structure:

CREATE TABLE sample_testx (
  id int(11) NOT NULL,
  office_phone text DEFAULT NULL,
  phone_number text DEFAULT NULL,
  a_phone_number text DEFAULT NULL,
  office text DEFAULT NULL,
  good_phones varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Sample data:

INSERT INTO sample_testx (id, office_phone, phone_number, a_phone_number, office, good_phones) VALUES
(1, '301', '7725', '1(301)775', '1(301) 258-7757 EXT127\"', NULL),
(2, ' \"1(301) 254-8721\"', NULL, 'Direct:5152051266', '(205) 661-0811|(187) 737-19353', NULL),
(3, '650.3585100Ext.128', '1497005', '813-832-3985', '(888) 822-7434', NULL),
(4, '772-778-2131 x07/06', '248.557-1200 Ext', '407¿5901433', '<strong>Office:<-strong>', NULL),
(5, '954%560%4373', '727=-303-0889', '&nbsp', '$4$-415-3575', NULL),
(6, '100090', NULL, ' \"1(301) 254-8721', NULL, NULL),
(7, '', NULL, 'error-404', '330-2225544a', NULL),
(8, '+1,324, 439,9114', '22290044-33', ' ', '...', NULL),
(9, '2,563,310,074', '20,575,566', '(334)794-0328 E', '205-655-8877 x2', NULL),
(10, '251-981-7470 x7487', '205-326-3000 x414-09', '[]', NULL, NULL),
(11, '(877) 890-0050|(256) 373-3450', '(256) 882-2223|(866) 218-0196|(256) 990-7653', NULL, '333 ext', NULL),
(12, '(256) 302-7611, (205) 661-0811, (256) 302-7611, (877) 371-9353, (877) 371-9353', '1(907) 252-2743\", \"1(907) 252-2743\", \"1(907) 252-2743', NULL, '555-444(3322)', NULL),
(13, 'Cellphone:515250', '1(650) 291-0470\", \"1(650) 291-0470\", \"1(DRE) :01-7482', '813-832-3985', '813-839-3463 x#', NULL),
(14, '989-686-3400 x#133', '248.557-1200 Ext', '305*335-0950', '561-859+5692', NULL),
(15, '573000000000', '239-472-5147`', '205.391-9903', '555-400-2030-6009008020-3338887744', NULL);

I want to extract phone numbers from the four columns office_phone, phone_number, a_phone_number, office and combine them comma separated into good_phones column. Some entries have non-numeric chars and leading 1’s. Also, some columns have multiple phone numbers to be extracted.

Here is my query so far:

UPDATE sample_testx AS s
JOIN (
	WITH expanded_columns AS ( -- extract phones from multiple columns, preserve column order
		SELECT id, office_phone AS phone_data, 1 AS col_num FROM sample_testx WHERE office_phone IS NOT NULL
		UNION ALL
		SELECT id, phone_number, 2 FROM sample_testx WHERE phone_number IS NOT NULL
		UNION ALL
		SELECT id, a_phone_number, 3 FROM sample_testx WHERE a_phone_number IS NOT NULL
		UNION ALL
		SELECT id, office, 4 FROM sample_testx WHERE office IS NOT NULL
	)
	, nn(part_num) AS (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) -- up to 10 entries per column
	, data_parts AS (
		SELECT *, SUBSTRING(dataN, (part_num - 1) * 10 + 1, 10) part
		FROM (
			SELECT id, col_num, phone_data
			, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(phone_data, '[(+.#]1', ''), '1[(+.#]', ''), '[^0-9]', '') dataN -- remove leading 1, remove all non-numeric chars
				FROM expanded_columns
			) a
			LEFT JOIN nn ON part_num <= (length(dataN) + 9) / 10
		)
		SELECT id
		, REPLACE(
			GROUP_CONCAT(DISTINCT CASE WHEN LENGTH(part) = 10 THEN part END ORDER BY col_num, part_num SEPARATOR ',') -- combine unique 10-digit phones with ,
		,',,',',') extracted_phones
		FROM data_parts
		GROUP BY id
		ORDER BY id
	) AS extracted
ON s.id = extracted.id
SET s.good_phones = extracted.extracted_phones;

This query checks the leading 1 in a manually specified number of scenarios in this regex line:

REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(phone_data, '[(+.#]1', ''), '1[(+.#]', ''), '[^0-9]', '')

which are (+.# but it needs to work for any char before or after the leading 1, which I failed to achieve. I tried a “remove all non-numeric chars regex” but it didn’t work.

Any help will be appreciated, thanks.

What if your 10 digits starts with a 1?

If the phone number in the database is 111 111 1111, should you get a blank?

(My anticipation is that you’re going to tell me you’re using US phone numbers under the NANP, which should be 10 digits long and do not start with a 1. My response is that 101 numbers are used for long distance carrier exchanges.)

What should you get if they haven’t given 10 digits at all? (234-5678)
What should you get if they have given an international phone number? (+52 299 999 9999 , calling someone in Mexico)

Extreme example: +680 333 2912. Reduced form 6803332912
I’m either calling a number in the Northern Mariana Islands (If dialing the number regularly inside the NANP zone), or a number in Timor-Leste (If dialing the number internationally).

They are US-only numbers yes. 10 digit numbers should start with 2-9. No number should start with 1 or 0.

If a cell has less than 10-digits, it should return NULL.

In the following table, good_phones column show the desired outputs for the given sample data.

INSERT INTO `sample_testx` (`id`, `office_phone`, `phone_number`, `a_phone_number`, `office`, `good_phones`) VALUES
(1, '301', '7725', '1(301)775', '1(301) 258-7757 EXT127"', '3012587757'),
(2, '"1(301) 254-8721"', NULL, 'Direct:5152051266', '(205) 661-0811|(187) 737-19353', '3012548721,5152051266,2056610811,8773719353'),
(3, '650.3585100Ext.128', '1497005', '813-832-3985', '(888) 822-7434', '6503585100,8138323985,8888227434'),
(4, '772-778-2131 x07/06', '248.557-1200 Ext', '407¿5901433', '<strong>Office:<-strong>', '7727782131,2485571200,4075901433'),
(5, '954%560%4373', '727=-303-0889', '&nbsp', '$4$-415-3575', '9545604373,7273030889'),
(6, '100090', NULL, '"1(301) 254-8721', NULL, '3012548721'),
(7, '', NULL, 'error-404', '330-2225544a', '3302225544'),
(8, '+1,324, 439,9114', '22290044-33', ' ', '...', '3244399114,2229004433'),
(9, '2,563,310,074', '20,575,566', '(334)794-0328 E', '205-655-8877 x2', '2563310074,3347940328,2056558877'),
(10, '251-981-7470 x7487', '205-326-3000 x414-09', '[]', NULL, '2519817470,2053263000'),
(11, '(877) 890-0050|(256) 373-3450', '(256) 882-2223|(866) 218-0196|(256) 990-7653', NULL, '333 ext', '8778900050,2563733450,2568822223,8662180196,2569907653'),
(12, '(256) 302-7611, (205) 661-0811, (256) 302-7611, (877) 371-9353, (877) 371-9353', '1(907) 252-2743","1(907) 252-2743","1(907) 252-2743', NULL, '555-444(3322)', '2563027611,2056610811,8773719353,9072522743,5554443322'),
(13, 'Cellphone:515250', '1(650) 291-0470","1(650) 291-0470","1(DRE) :01-7482', '813-832-3985', '813-839-3463 x#', '6502910470,8138323985,8138393463'),
(14, '989-686-3400 x#133', '248.557-1200 Ext', '305*335-0950', '561-859+5692', '9896863400,2485571200,3053350950,5618595692'),
(15, '573000000000', '239-472-5147`', '205.391-9903', '555-400-2030-6009008020-3338887744', '5730000000,2394725147,2053919903,5554002030,6009008020,3338887744');

This should hint at you that your regex can be refined.

Rather than finding a 10-or-11 digit number and then stripping leading 1s, find a 10 digit number that does not start with a 0 ot 1 to begin with.

([2-9][0-9]{9})