SitePoint Sponsor

User Tag List

Results 1 to 19 of 19
  1. #1
    SitePoint Addict
    Join Date
    Sep 2000
    Location
    Ontario, Canada
    Posts
    320
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    normalization vs speed

    I have a couple tables in one of my databases that contain the same fields ie. name, phone, address etc.
    I'm sure that this is far from being "normalized" but I was wondering how far one should take normalization before using tons of joins and possibly increase query time.

    I've been thinking of making 1 table for the fields consistent through all the tables but I would have to change all my update/select and insert queries to do so. Would I be gaining anything from carrying out this task? There are 16 fields in my tables that can be placed into a new table and joined.

    Any opionions are appreciated
    Jason Dulberg
    Extreme MTB
    http://extreme.nas.net

  2. #2
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This should probably be moved to the database forum.

    Anyway I dont think it's normalisation OR speed, it's normalisation AND speed or even normalisation = speed.

    Basically, apart from the nightmares you'll face in the future from have the same data appear twice in your database, normalisation helps you keep you database smaller, which will improve performace alot.

    For web pages, it's rare that you have to join more than three tables and not sure if there's a performance hit in doing so, but you could also look into indexing to help you.

    Basically if you break your data down into "objects", you'll quickly be able to normalise.

    For example if you had a users table, where you're currently storing username, password, group (i.e. permissions), street, postcode, country, you could break this into four "objects".

    Object user: user_id (primary key), username, password, street

    Object postcode: postcode_id (primary key), postcode

    Object country: country_id (primary key), country

    Object group: group_id ( primary key), group

    You then have a one to many relationship between postcode and users, between countrys and users and perhaps a many to many relationship between groups and users (for which you'll need a lookup table on user_id and group_id).

  3. #3
    SitePoint Addict
    Join Date
    Sep 2000
    Location
    Ontario, Canada
    Posts
    320
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    d'oh, forgot about that database forum. If the moderators are reading this, please move it to the appropriate forum.

    Perhaps I should clear things up a bit. The 3 different tables will not have the same information, they just contain the same fields. For instance, I have 3 tables - basketball, football, baseball. Each of those tables contains name, email etc. fields. These fields will store info on different people... the data will never cross over from one table to the other.

    I totally understand what you mean about future relationships. So bearing in mind what I mentioned above, by splitting these tables apart, I'd be making a search faster?

    Thanks again!
    Jason Dulberg
    Extreme MTB
    http://extreme.nas.net

  4. #4
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    For instance, I have 3 tables - basketball, football, baseball. Each of those tables contains name, email etc. fields.
    Interesting - probably in this case, it may be about the same performance, whether you normalise or not. Depends on whether you're getting the same name and email appearing twice in say the basketball table.

    There probably is a small performance hit for joining tables (but small it will be - relational databases should be able to handle this stuff well, without any noticable performance change). But depending on what a site visitor is searching for, you may get faster queries sometimes.

    For example if you seperate name and email into a different table, from the basket and a site visitor is searching for all entries in the basketball table which correspond to a certain name in the users table, you search will be faster, as you now just need to match an integer "user_id" in the basketball table.

    But if all the data in one row of the basketball table only ever appears once, there probably will be no benefit from breaking the table. In that case, you're probably better off looking into indexes - try this article.

  5. #5
    Hi there! Owen's Avatar
    Join Date
    Jan 2000
    Location
    CA
    Posts
    1,165
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Breaking the table is convenient IMHO:

    - a person can play basketball and baseball w/out you having to duplicate all his info.
    - you can easily search by name -- otherwise to find someone you gotta search 3 tables.

    So make one table for basics (name, address, email, sportID(to link to basketball,etc), etc) and one for specific info basketball(userID, sportID, baskets, gamesPlayed, etc).

    Owen

    BTW> HarryF, don't make a seperate table for zip codes... if you have a lot of users then (for the US anyway) you'll have ~99,999 rows which is just silly... you'll save no space or effort at all. With normalization you gotta look at common sense. There's no point in normalizing something like a zip code. For example normalizing country makes sense since (for example) if a country changes name you'll only have to change one code to update everything.

  6. #6
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    BTW> HarryF, don't make a seperate table for zip codes... if you have a lot of users then (for the US anyway) you'll have ~99,999 rows which is just silly... you'll save no space or effort at all. With normalization you gotta look at common sense. There's no point in normalizing something like a zip code. For example normalizing country makes sense since (for example) if a country changes name you'll only have to change one code to update everything.
    I guess so, but you have a seperate table for post code and rather that pre-populate it with all 99,999 possiblities, you just add postcodes as users enter their details.

    But it is a good point. For the same reason, I wasn't suggesting making the "street" column a seperate table, unless there's a good chance of users living on the same street.

  7. #7
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I would agree with Harry's generalization. Typically it works well due to the smaller, more compact (e.g. fixed-length datatypes), tables.

    I would never use MySQL as a good case study for complex queries since MySQL's optimizer is piss-poor and doesn't handle really anything well.

    The place I've noticed denormilization helps a bit is with large aggregate queries, such as:
    Code:
    SELECT COUNT( someID  ),
             MAX( t2.blah )
      FROM table1      t1 
     INNER JOIN table2 t2 ON t1.ID = t2.ID
     WHERE t1.somecol BETWEEN 23 AND 340000
    Generally this is a 'brute force' operation and the optimizer can't do much except count rows (index or table) and do other high physical I/O-based operations, which is not good.

    I think putting postcodes in another table is fine. If not, how do you let the user enter the data? By hand in a textbox and simply store the INT value in the database? Are you going to validate it at all? If so, you'll need to store the values somewhere. Plus, it would allow you to do this:
    Code:
    CREATE TABLE postcode(
      postcode INT,
      state    CHAR( 2 )
    )
    
    CREATE TABLE state(
      name      VARCHAR( 100 ),
      stateabbr    CHAR( 2   ),
      flower,
      bird
    )
    So you also have your state linkage so that you don't have to ask the user for it (could save space there).

    Back to the query at hand, jdulberg what exactly are you trying to do? If we know your reqirements we can make a more informed, and educated, stab at your problem.

  8. #8
    SitePoint Wizard silver trophy Jeremy W.'s Avatar
    Join Date
    Jun 2001
    Location
    Toronto, Canada
    Posts
    9,123
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I'm Mr Non Database Pro here, but it seems to me that if it was a tossup between normalization and speed I would choose the former. It allows me to scale my application, change data, allow greater flexibility that wouldn't be possible otherwise.

    Besides, processor and DBMS speeds will only improve over time, not so for my data
    SVP Marketing, SoCast SRM
    Personal blog: Strategerize
    Twitter: @jeremywright

  9. #9
    SitePoint Addict
    Join Date
    Sep 2000
    Location
    Ontario, Canada
    Posts
    320
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This has certainly become a great thread to read!

    At this point in time, our system does not cater to players who play different sports as all players are added by a team coach. I have however built in the capacity to switch which sport the coach is associated with which would force the player to carry over.

    One reason why I was thinking of normalization of some of the player tables is due to the number fields involved. The baseball table for instance has 70 fields due to the amount of stats required for each player position (a pitcher alone has 30 stats). Although, these stats will only be called upon in the detailed display and I am not searching using SELECT * which would obviously slow down the search.

    Thanks everyone for your ideas!
    Jason Dulberg
    Extreme MTB
    http://extreme.nas.net

  10. #10
    Hi there! Owen's Avatar
    Join Date
    Jan 2000
    Location
    CA
    Posts
    1,165
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by MattR

    Code:
    CREATE TABLE postcode(
      postcode INT,
      state    CHAR( 2 )
    )
    
    CREATE TABLE state(
      name      VARCHAR( 100 ),
      stateabbr    CHAR( 2   ),
      flower,
      bird
    )
    Well you still need to ask for the state to get the data the first time right? (Well not really true, but asking for the state is a good "reality" check to make sure user didn't make a typo.)

    Think about it this way. Once you have enough users you're going to have 10,000 different keys for postalcode. (Now the code above is better since the key is the same as the postal code, but again barely worth it.) However, postal codes come in groups so 90000-94000 (or something like that) is California. So actually it would be much more efficient to have a ~50 row database storing ranges then a 10,000 row database. Then since it's only ~50 rows (edit: that are static and essentially never change negating the need to store it in a database), might as well make it into an array and store it in memory to reduce database accesses and speed up your program. Therefore a zip code table unnecessary.

    Owen
    Last edited by Owen; Jun 12, 2002 at 20:06.

  11. #11
    Hi there! Owen's Avatar
    Join Date
    Jan 2000
    Location
    CA
    Posts
    1,165
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well you want a balance... if a query takes 20 minutes to run when it could take 1 minute that isn't any good.

    Owen

    Originally posted by Jeremy W.
    I'm Mr Non Database Pro here, but it seems to me that if it was a tossup between normalization and speed I would choose the former. It allows me to scale my application, change data, allow greater flexibility that wouldn't be possible otherwise.

    Besides, processor and DBMS speeds will only improve over time, not so for my data

  12. #12
    Just Blow It bronze trophy
    DaveMaxwell's Avatar
    Join Date
    Nov 1999
    Location
    Mechanicsburg, PA
    Posts
    7,294
    Mentioned
    123 Post(s)
    Tagged
    1 Thread(s)
    In general, third normal form of database normalization will increase your speed if you know how to do it properly and also build your queries properly.

    Take the zip code. If you are accepting an address, you will ALWAYS have one and only one zip code for an address. It doesn't have anything to do with the state, so it doesn't make sense to create a separate table with postal codes. The storage would be the same on your address table (assuming you store the address as a numeric) and would not require a "lookup" table to cross reference the address and the zip code. Splitting out the zip code would not be 3rd normal form, IMNSHO.

    If you've got a query that runs in one minute unnormalized and 20 minutes normalized, the problem is in the normalization. It was either a) taken well past third normal form (which never made sense to me to do) or b) the normalization was not done correctly and should be looked at.
    Dave Maxwell - Manage Your Site Team Leader
    My favorite YouTube Video! | Star Wars, Dr Suess Style
    Learn how to be ready for The Forums' Move to Discourse

  13. #13
    SitePoint Enthusiast
    Join Date
    Apr 2002
    Posts
    58
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I don't understand why this is an issue. You should always normalize to 4NF and only consider reducing if there is a performance issue. In your case, I can't imagine why there would be. Most databases that reduce the normalization do so because of totals being calculated all the time, not because of addresses in multiple tables.

    However, I do think you should break general player information apart from their stats and the sports they play. This means ALL players from ALL sports will be in the same table, and you simple use the player_id to match them up to their stats. This will allow your system to handle players in multiple sports and it won't hurt performance at all.

  14. #14
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I would avoid making generalizations such as "You should always design to X". Rarely are generalizations ever true, especially in computer science. I find 4NF to be overly-redundant and the tradeoff between simplicity and performance is much too great when moving from BCNF (where I try and keep all my tables) to 4NF.

    I think BCNF is the quinticential 'best' form to be in. You don't have the excess data storeage of 4NF, nor some badThings about 3NF (although 3NF is pretty good in its own right).
    Last edited by MattR; Jun 17, 2002 at 14:29.

  15. #15
    SitePoint Enthusiast
    Join Date
    Apr 2002
    Posts
    58
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    All generalizations are bad, right? lol.

    Anyway, 4NF is BCNF, but a given relation may not contain more than one multivalued attribute. For example, if we look at clients, phones, and emails, we can safely say that a client can have multiple phones and multiple emails. Therefore, a table such as:

    Client-Phone-Email
    Jane-555-jane@home.com
    Jane-556-null
    Tom-557-tom@home.com
    Tom-null-tom2@home.com
    Bill-558-bill@home.com

    is not in 4NF. In order to achieve 4NF in this case, you would have 3 tables: client, client_phone, and client_email. So you would have data that looked like:

    client (client_id, name)
    1-jane
    2-tom
    3-bill

    client_phone (client_phone_id, client_id, phone)
    1-1-555
    2-1-556
    3-2-557
    4-3-558

    client_email (client_email_id, client_id, email)
    1-1-jane@home.com
    2-2-tom@home.com
    3-2-tom2@home.com
    4-3-bill@home.com

    It might seem more complicated, but I don't see how it's redundant because the data is never repeated (I definitely do not consider xref ids as duplication of data). IMO, 4NF is a better design than BCNF because it allows greater flexibility, but to each their own.

  16. #16
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, first off those table structures suffer from what I call "Auto_increment Appendment Insanity" whereby everyone appends auto_increments to the fronts of the tables. Client_ID should be a perfectly fine primary key for all of the child tables.

    The reason why I dislike 4NF (which is also 3NF, 2NF, etc., that's the fun of the normilization levels) is because there is nothing (that I can see) gained by having these many tables if your values are NOT NULL (or vary rarely NULL). Perhaps redundant was a bad term, perhaps not disk-space friendly.

    You've got the client_id repeated several times for each record -- and so you will need more disk space to store the same amount of data for an average record. Not to mention extra space required for indexes (since you'd need an index on client_id), and that could push more things out of memory.

    It does, though, create the ability to easily turn your one-to-one relationship into one-to-many in that you can now have many phone numbers for a particular client_phone. If that is not a requirement then again I think the extra table is not worth it -- you also have to make sure you implement cascading deletes and such when you delete your primary table.

  17. #17
    SitePoint Enthusiast
    Join Date
    Apr 2002
    Posts
    58
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, having a unique identifier for each table may not be the best way to save (cheap) disk space, but it's definitely consistent and actually comes in handy in disaster situations and merging. Technically I could use SSN/SIN to represent a unique client, since it saves me about 4kb of space ... I think it's a matter of preference. I've worked in environments that used 2 keys in a xref table to represent the primary key and that's fine, but not my preference.

    Normalizing to 4NF isn't really a matter of removing nulls, but a matter of removing multiple multivalues. As for being disk-space unfriendly, when you are dealing with large databases, a couple of MBs isn't going to make a difference.

    "You've got the client_id repeated several times for each record -- and so you will need more disk space to store the same amount of data for an average record."
    - The problem with this argument is that when you look at the entire client table, it contains about 15 fields. Each of which would either be repeated or null for multiple phones/emails. Not to mention the nightmare of maintaining the primary key! Unless you simply said clients can only have 1 email and 1 phone ... which is short sighted to say the least. As for maintaining an index on a client_id that is not a primary key, I'd say it's best left untouched unless absolutely required.

    The reason I prefer 4NF is because of the flexibility. Even if the requirement for multiple phones isn't present at the moment, who says it won't be in a month or a year? If you stayed with BCNF, you'd be required to change your database design AND thousands of reports/stored procedures.

    I'm not saying BCNF is a horrible choice. It's probably an acceptable level for a lot of environments. But I'll take the level that offers me the least amount of headaches when someone says, "Oh, btw the way, we need to store multiple phone numbers for each client ......."

  18. #18
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In regards to the auto_increment I was referring to the client_phone_id and client_email_id columns in the lookup tables. They are redundant and not necessary. However to make use of the foreign key _user_id_ you would place an index on it, otherwise you'd table scan each time which is NoGood(TM).

    When you are dealing with large databases space is the issue.

    Obviously it depends on your application. Saying that you need to design more complexity into the situation (eg. the many-to-many tables) 'just in case' may or may not make sense based upon your application. If I know that there is credible thouht that we'll need more than one, I'd probably go ahead and design it that way beforehand. For an email address, sure it makes sense to allow multiple per users if you use them soley as storage. At least in our apps we only need one to send them emails -- and no more. It's not a user area with the ability for users to add infinite email accounts and use it as some sort of contact center. So there's absolutely no business case that can be made to split it up into several tables. It sounds like you're working on a different application -- and that's fine. Again, absolutes are BadThings(TM)

    Again, I'm not saying design a table like this:
    user( email1, email2, etc. )
    But if you know that you don't need more than one, there's no need to split it up into that sort of a table.

    In order to be 'really' 4NF you'd have to break a lot out of the table, username for example (if you're using user_id auto_increment or SSN). Password, etc. -- typically things which you know are singular. To normalize or not is the question -- and it all depends on your application and what you want to accomplish. To design all your tables 4NF (or even 5) may waste a lot of developer time when it could be spent elsewhere.

    NULLs do not take up (much) space. 1 byte generally due to placeholders, so you're talking maybe 15 bytes for nulls in your table if they were all empty, vs. something like 4 bytes * 15 (userid FK lookup in each table) + sizeof( data in column) so there is no doubt you will use more space with the 4NF method. If the majority of your rows have these multiple columns NULL then yes, the 4NF method would save more space since you wouldn't have many rows. If they are a majority of the time used then the 4NF method will cost more storage space. Again, you need to know your data (I've said that time and time again) when developing your application.

    Contrary to popular belief, server memory (where your data caches live) is NOT cheap. Most hardware which supports more than, say, 4GB of RAM is quite expensive, yet databases of 10, 20, 30GB is quite common nowadays. Obviously you cannot cache all of your indexes, data members, etc. -- but you *can* cache more with BCNF since it removes redundancy that exists in earler normal forms but doesn't create additional overhead.

  19. #19
    SitePoint Enthusiast
    Join Date
    Apr 2002
    Posts
    58
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've enjoyed the conversation, but I don't want to drag it out forever. This will probably be my last reply.

    You mentioned an indexing problem, but the design actually promotes the usage of more powerful clustered indexes. The more data that is separated into tables, the more these indexes become available.

    The reason why I mentioned NULLs as a problem wasn't because of space, it was because NULLs can greatly reduce database performance.

    Perhaps knowing your data is important, but preparing for changes is on equal grounds. Just because someone says the user will NEVER have more than 1 email address doesn't mean it won't change in the future. Personally, I'd rather spend more time on analysis and design than development. Changing 100+ reports because of table changes that could have been avoided isn't my idea of a fun weekend.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •