Save lowercase username or make case insensitive queries?

llnitoll · March 3, 2013, 7:55pm

Duplicate millions of username to faster search or use LOWER queries? Or in other words how slow is LOWER(username)=LOWER(‘$username’).

I guess use LOWER is the way but I’m concerned about the performance of using LOWER.

I’m using utf8_bin, so my MySQL is case sensitive, I should stop using it, at least for that table?

r937 · March 3, 2013, 9:01pm

if you have an index on that column, it will be ignored because your query applies a function to it

stop using case sensitivity for that column, is my suggestion

alternatively, go through (once) and fix all values in that column to lower case, and then never ever put anything into that column again unless you have first converted to lower case in php, and then your case sensitivie column can use the index

llnitoll · March 3, 2013, 11:55pm

Now I’m doing the “duplicated” way. I start doing it because phpbb forums work like that for usernames, so I think it must be good.

Is just I’m trying to remove all duplicated data I can, but this is a critical column, I can’t have two users with the same name, so the collation I use must be 100% accurate, when searching if a username alredy exist.
I read utf8_unicode_ci has good accuracy, but I’m afraid is not as accurate as utf8_bin is.

r937 · March 4, 2013, 12:24am

i’m sorry, i have no idea what this means, and i don’t know if i can help you any further

llnitoll · March 4, 2013, 3:35am

Perhaps I’m confused, reading about the difference of unicode and general:

Accuracy of sorting:

utf8_unicode_ci is based on the Unicode standard for sorting, and sorts accurately in a wide range of languages.

utf8_general_ci comes very close to correct Unicode sorting in many languages, but has a number of inaccuracies in some languages.

And:

The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.

Perhaps this has nothing to do when searching matches :-/

http://stackoverflow.com/questions/367711/what-is-the-best-collation-to-use-for-mysql-with-php
http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

llnitoll · March 4, 2013, 7:55pm

Here is more clear:

It’s MySQL trying to be confusing.

The difference lies in comparisions. If you did
CODE: SELECT ALL
SELECT username FROM users WHERE username < ‘Cheese’;

It might return different results based on the encoding. For example, utf8_unicode_ci changes the german “double-s” character (ß) into “ss” for comparison. But the bin variation uses the strings “as they are” with their binary value.

So a better example would be these two queries:
CODE: SELECT ALL
SELECT username FROM users WHERE username < ‘ß’;
SELECT username FROM users WHERE username < ‘ss’;

On the utf8_unicode_ci, they should give the same result. They should not give the same result in utf8_bin – assuming that your colletion of usernames is large enough.

https://www.phpbb.com/community/viewtopic.php?f=71&t=1089395

I think this does not means that utf8_unicode_ci is less accurate (well it is but it finds all matches that have to) when comparing. Then I read this:

The utf8_bin collation compares strings based purely on their Unicode codepoint values. If all of the codepoints have the same values, then the strings are equal. However, this falls apart when you have strings with different composition for combining marks (composed vs. decomposed) or characters that are canonically equivalent but don’t have the same codepoint value. In some cases, using utf8_bin will result in strings not matching when you expect them to. Theoretically, utf8_bin is the fastest because no Unicode normalisation is applied to the strings, but it may not be what you want.

If I undestand it right utf8_unicode_ci will find even more matches than utf8_bin, with utf8_bin two usernames that look the same, will be different for the bin variation so they can coexist.
I’m gonna do some tests with utf8_unicode_ci, to see if it works correctly, and how slow it is.

Topic		Replies	Views
Ignore case sensitive when searching in mysql db Databases	4	9738	September 23, 2014
MySQL case sensitive Databases	5	949	November 9, 2014
Search Case Sensitive SQL Column PHP sql	6	1188	February 19, 2020
Using 'Lower' in queries vs storing lowercase columns (MSSQL) Databases	3	1555	September 24, 2014
Dealing with the lowercase text for LIKE operator Databases	7	1052	September 14, 2018

Save lowercase username or make case insensitive queries?

Related topics