SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Member
    Join Date
    Feb 2011
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Save lowercase username or make case insensitive queries?

    Duplicate millions of username to faster search or use LOWER queries? Or in other words how slow is LOWER(username)=LOWER('$username').

    I guess use LOWER is the way but I'm concerned about the performance of using LOWER.

    I'm using utf8_bin, so my MySQL is case sensitive, I should stop using it, at least for that table?

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,262
    Mentioned
    60 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by llnitoll View Post
    Or in other words how slow is LOWER(username)=LOWER('$username').
    if you have an index on that column, it will be ignored because your query applies a function to it

    stop using case sensitivity for that column, is my suggestion

    alternatively, go through (once) and fix all values in that column to lower case, and then never ever put anything into that column again unless you have first converted to lower case in php, and then your case sensitivie column can use the index
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Member
    Join Date
    Feb 2011
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Now I'm doing the "duplicated" way. I start doing it because phpbb forums work like that for usernames, so I think it must be good.

    Is just I'm trying to remove all duplicated data I can, but this is a critical column, I can't have two users with the same name, so the collation I use must be 100% accurate, when searching if a username alredy exist.
    I read utf8_unicode_ci has good accuracy, but I'm afraid is not as accurate as utf8_bin is.

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,262
    Mentioned
    60 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by llnitoll View Post
    I read utf8_unicode_ci has good accuracy, but I'm afraid is not as accurate as utf8_bin is.
    i'm sorry, i have no idea what this means, and i don't know if i can help you any further
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Member
    Join Date
    Feb 2011
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Perhaps I'm confused, reading about the difference of unicode and general:

    Accuracy of sorting:

    utf8_unicode_ci is based on the Unicode standard for sorting, and sorts accurately in a wide range of languages.

    utf8_general_ci comes very close to correct Unicode sorting in many languages, but has a number of inaccuracies in some languages.
    And:

    The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.
    Perhaps this has nothing to do when searching matches

    http://stackoverflow.com/questions/3...mysql-with-php
    http://stackoverflow.com/questions/7...tf8-unicode-ci

  6. #6
    SitePoint Member
    Join Date
    Feb 2011
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here is more clear:
    It's MySQL trying to be confusing.

    The difference lies in comparisions. If you did
    CODE: SELECT ALL
    SELECT username FROM users WHERE username < 'Cheese';

    It might return different results based on the encoding. For example, utf8_unicode_ci changes the german "double-s" character () into "ss" for comparison. But the bin variation uses the strings "as they are" with their binary value.

    So a better example would be these two queries:
    CODE: SELECT ALL
    SELECT username FROM users WHERE username < '';
    SELECT username FROM users WHERE username < 'ss';

    On the utf8_unicode_ci, they should give the same result. They should not give the same result in utf8_bin -- assuming that your colletion of usernames is large enough.
    https://www.phpbb.com/community/view...f=71&t=1089395

    I think this does not means that utf8_unicode_ci is less accurate (well it is but it finds all matches that have to) when comparing. Then I read this:

    The utf8_bin collation compares strings based purely on their Unicode codepoint values. If all of the codepoints have the same values, then the strings are equal. However, this falls apart when you have strings with different composition for combining marks (composed vs. decomposed) or characters that are canonically equivalent but don't have the same codepoint value. In some cases, using utf8_bin will result in strings not matching when you expect them to. Theoretically, utf8_bin is the fastest because no Unicode normalisation is applied to the strings, but it may not be what you want.
    http://stackoverflow.com/questions/1...utf-unicode-ci

    If I undestand it right utf8_unicode_ci will find even more matches than utf8_bin, with utf8_bin two usernames that look the same, will be different for the bin variation so they can coexist.
    I'm gonna do some tests with utf8_unicode_ci, to see if it works correctly, and how slow it is.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •