SitePoint Sponsor

User Tag List

Results 1 to 25 of 25
  1. #1
    SitePoint Enthusiast
    Join Date
    Sep 2004
    Location
    Malmö, Sweden
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Where to store object-translation-strings? (OO-Design and I18N-question)

    Hi,

    I have spent the last couple of hours browsing the net and this forum on I18N-related threads. But most of the threads circle around the translation-classes, but my question is OO-design one. The situation is as follows:

    I have an application which makes use of the PEAR::Translation2 package. Hence I have a table in my db:

    translation_strings(string_id, module_name, en_US, sv_SE, lang_colX)

    which holds most of the strings which I use in the application. The problem is that now I am extending the application with functions where the users will be able to post information (which will be converted to objects).

    Where do I store the translation of the object-properties?

    If the object has a property, e.g. title, do I store it in the object-table or do I make a reference to the translation_strings -table?

    How would you do it?
    I see both advantages and disadvantages with both approaches.

    Thanks in advance,
    //jan
    Jan Bolmeson, M.Sc. Engineering Physics, ZCE
    Join my network @ LinkedIn.com

  2. #2
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Obviously you have multiple options. The simplest IMHO is to extend your primary key to include a 'locale' column.
    Say you have a table :
    Code:
    CREATE TABLE `articles` (
      `uuid` int(11) NOT NULL auto_increment,
      `title` varchar(255) NOT NULL default '',
      `body` text NOT NULL default '',
      PRIMARY KEY  (`uuid`)
    )
    Alter that into this :
    Code:
    CREATE TABLE `articles` (
      `uuid` int(11) NOT NULL auto_increment,
      `locale` varchar(2) NOT NULL default 'en',
      `title` varchar(255) NOT NULL default '',
      `body` text NOT NULL default '',
      PRIMARY KEY  (`uuid`,`locale`)
    )

  3. #3
    SitePoint Enthusiast
    Join Date
    Sep 2004
    Location
    Malmö, Sweden
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hmm...

    Didn't really think it could be solved this way, with different tuples for each locale... I guess I was to concentrated on solving it with one PK (uuid) instead of extending it to also include the locale.

    I'll have to think about for a while, as for now - thanks..

    //jan
    Jan Bolmeson, M.Sc. Engineering Physics, ZCE
    Join my network @ LinkedIn.com

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You could decide that there will always be a default version ('en' in this case), that way you wouldn't have to alter much of you client code. For example if you have a gateway with an interface like this :
    PHP Code:
    class ArticleGateway
    {
        function & 
    getByUUID($uuid) {
            
    // ...
        
    }

    You just change that into :
    PHP Code:
    class ArticleGateway
    {
        function & 
    getByUUID($uuid$locale "en") {
            
    // ...
        
    }

    A minor weakness of this design is that you might have some columns, which you want to localize, while others you don't. In this case you'll have to accept some redundancy, which your businesslogic will have to deal with. Not that much of a headache though.

  5. #5
    SitePoint Zealot johno's Avatar
    Join Date
    Sep 2003
    Location
    Bratislava, Slovakia
    Posts
    184
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken
    A minor weakness of this design is that you might have some columns, which you want to localize, while others you don't. In this case you'll have to accept some redundancy, which your businesslogic will have to deal with. Not that much of a headache though.
    I would try it with normalized tables first.

    Here's an example:

    Code:
    articles table - non-locale specific data:
    	id         - primary key
    	author_id  - foreign key to some authors table
    	created    - creation date
    
    article_translations table - locale specific data:
    	article_id   - foreign key to article table
    	language_id  - foreign key to languages table
    	title        - title in specific language
    	body         - body in specific language
    	
    languages table
    	id           - primary key
    	code         - varchar(3) language code e.g. sk/en/cz/hu
    Of course you have to do some easy joins in ArticleGateway but the logic seems to be the same. No data redundancy. If query performance starts to be an issue, THEN start denormalization process.
    Annotations support for PHP5
    TC/OPT™ Group Leader

  6. #6
    SitePoint Zealot
    Join Date
    Mar 2004
    Location
    Australia
    Posts
    101
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In addition, if your database supports views and there aren't that many languages, it may be useful to create a view for each locale. Thus your original mappers wouldn't require too many changes, may be just the table names to their respective view names which are localized.

  7. #7
    SitePoint Guru 33degrees's Avatar
    Join Date
    May 2005
    Posts
    707
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johno
    I would try it with normalized tables first.

    Here's an example:

    Code:
    articles table - non-locale specific data:
    	id         - primary key
    	author_id  - foreign key to some authors table
    	created    - creation date
    
    article_translations table - locale specific data:
    	article_id   - foreign key to article table
    	language_id  - foreign key to languages table
    	title        - title in specific language
    	body         - body in specific language
    	
    languages table
    	id           - primary key
    	code         - varchar(3) language code e.g. sk/en/cz/hu
    Of course you have to do some easy joins in ArticleGateway but the logic seems to be the same. No data redundancy. If query performance starts to be an issue, THEN start denormalization process.
    Since the 3 letter language codes are guaranteed to be unique, you can simply use them as the language id, instead of having to join a 3rd languages table to get a record in a given language.

    You can find the list of codes here: http://www.w3.org/WAI/ER/IG/ert/iso639.htm

    You could also add a revision field to the article_translations table, which would allow you to keep multiple versions of an article.

  8. #8
    SitePoint Zealot johno's Avatar
    Join Date
    Sep 2003
    Location
    Bratislava, Slovakia
    Posts
    184
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Since the 3 letter language codes are guaranteed to be unique, you can simply use them as the language id, instead of having to join a 3rd languages table to get a record in a given language.
    Great idea.
    Annotations support for PHP5
    TC/OPT™ Group Leader

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johno
    Of course you have to do some easy joins in ArticleGateway but the logic seems to be the same. No data redundancy. If query performance starts to be an issue, THEN start denormalization process.
    You're right ofcourse, I were thinking a bit ahead. Having to do a join to select what could beforehand be done without any joins, was raising an alarm with me. That alarm probably shouldn't go off untill the performance actually suffers.
    If I'm not mistaken though, MySql begins to cringe at as low as two or three joins ? Then again - madwax might not use MySql at all.

  10. #10
    SitePoint Enthusiast
    Join Date
    Sep 2004
    Location
    Malmö, Sweden
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That was exactly my concern - that in the end I will end up with several joins and suffer performance issues (MySQL 4.0) - hence my thoughts into adding a language column. But I realized, as you point out, that isn't a good solution. The problem, I think is to decide where the performance issues are smallest - in doing multiple joins/ multiple selects or doing it in the application code.


    Then I would also like to add a column to the language table which is encoding and for the language code I think it is better to use language-locale (e.g. en_US, sv_SE etc http://www.mpi-sb.mpg.de/~pesca/locales.html ) since often you will have to change both the language and the locale.

    Thanks.
    //madwax
    Jan Bolmeson, M.Sc. Engineering Physics, ZCE
    Join my network @ LinkedIn.com

  11. #11
    SitePoint Zealot johno's Avatar
    Join Date
    Sep 2003
    Location
    Bratislava, Slovakia
    Posts
    184
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    MADWAX: Are you sure that joins will be performance bottleneck? With good indexes I don't think so. I will allways try to leave database tables normalized. Its just better for maintenance. Performance issues are solved when they occur not before.
    Annotations support for PHP5
    TC/OPT™ Group Leader

  12. #12
    SitePoint Enthusiast
    Join Date
    Sep 2004
    Location
    Malmö, Sweden
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johno
    MADWAX: Are you sure that joins will be performance bottleneck? With good indexes I don't think so. I will allways try to leave database tables normalized. Its just better for maintenance.
    I totally agree with you that normalized tables and relations are too aim for - both from maintenance and redundancy point of view. BUT, not that i applies in this case, as you also probably know - theoretical normalization often differs from how you implement database-structures in real-life. In most cases it is easier to implement a ID-columnt rather then a four-column primary key.

    Quote Originally Posted by johno
    Performance issues are solved when they occur not before.
    No no no no no... =o| This is as wrong as the normal solution which is throwing more hardware at a performance problem. You have to consider performance problems and optimization issued before they occur - because later it may very well be too late. The time to fix things increases logaritmically with the complexity (and size) of an application. At least in real-world commercial applications.

    In the situation I am in I cannot afford adressing performance issues when they occur, mainly from a business perspective...
    Jan Bolmeson, M.Sc. Engineering Physics, ZCE
    Join my network @ LinkedIn.com

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You should be able to deal with encoding-issues by simply keeping all strings as as utf-8. I'm not totally sure what that implies for performance, but it's kind of the natural match for php.

    I'm not sure at all about the performance-issues in relation to joins - some benchmarks might be appropiate ?

  14. #14
    SitePoint Zealot johno's Avatar
    Join Date
    Sep 2003
    Location
    Bratislava, Slovakia
    Posts
    184
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken
    You're right ofcourse, I were thinking a bit ahead. Having to do a join to select what could beforehand be done without any joins, was raising an alarm with me. That alarm probably shouldn't go off untill the performance actually suffers.
    If I'm not mistaken though, MySql begins to cringe at as low as two or three joins ? Then again - madwax might not use MySql at all.
    Sounds like premature optimization to me.

    I know what you are trying to say, but in this case I just don't see any performance issues that could possibly happen while fetching article data. Joins are really simple and fast if you are using unique and/or primary keys. I'm pretty sure MySQL can handle that.

    Quote Originally Posted by madmax
    No no no no no... =o| This is as wrong as the normal solution which is throwing more hardware at a performance problem. You have to consider performance problems and optimization issued before they occur - because later it may very well be too late. The time to fix things increases logaritmically with the complexity (and size) of an application. At least in real-world commercial applications.

    In the situation I am in I cannot afford adressing performance issues when they occur, mainly from a business perspective...
    Ok. Let's be more practical. Where is the query that will be the problem? Are you doing some complex joining, grouping and/or calculations when fetching locale specific article data? I don't think so.

    PS. Sorry for my english.
    Annotations support for PHP5
    TC/OPT™ Group Leader

  15. #15
    SitePoint Enthusiast
    Join Date
    Sep 2004
    Location
    Malmö, Sweden
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken
    You should be able to deal with encoding-issues by simply keeping all strings as as utf-8. I'm not totally sure what that implies for performance, but it's kind of the natural match for php.
    Maybe it is just me - but I have had serious troubles with UTF-8 since the application is mainly non-english (swedish with signs as å (å) ö (ö) etc), so instead I use iso-8859-1. Is it possible to use these characters in UTF-8?

    Quote Originally Posted by kyberfabrikken
    I'm not sure at all about the performance-issues in relation to joins - some benchmarks might be appropiate ?
    Would be interesting - I just go by gut-feeling, but I would like to be proven wrong.

    Quote Originally Posted by johno
    Ok. Let's be more practical. Where is the query that will be the problem? Are you doing some complex joining, grouping and/or calculations when fetching locale specific article data? I don't think so.
    Thanks for the help, but I am interested in this issue in an abstract level. The issue is not really an article table but a rather more complex iso-document-management-application where I use several joins, calculations and object-instantiations (hence my concern for performance issues)

    Quote Originally Posted by johno
    PS. Sorry for my english.
    No problem - I think most of us are from non-english origin :)
    Jan Bolmeson, M.Sc. Engineering Physics, ZCE
    Join my network @ LinkedIn.com

  16. #16
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by madwax
    Maybe it is just me - but I have had serious troubles with UTF-8 since the application is mainly non-english (swedish with signs as å (å) ö (ö) etc), so instead I use iso-8859-1. Is it possible to use these characters in UTF-8?
    Sure. UTF-8 is a unicode encoding - it covers every character in every human language and a little more. That's the nice thing about UTF-8 - once you made the shift and figured out to make it work, you don't have to worry about encodings anymore, since you can use the same for all langauges.

  17. #17
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by madwax
    Maybe it is just me - but I have had serious troubles with UTF-8 since the application is mainly non-english (swedish with signs as å (&aring ö (&ouml etc), so instead I use iso-8859-1. Is it possible to use these characters in UTF-8?
    You really cannot work with unicode in php (until version 6). Therefore, if you need only west-european languages to be supported, stick with ISO.

    iso-8859 includes most (west)-european letters.

  18. #18
    SitePoint Enthusiast
    Join Date
    Sep 2004
    Location
    Malmö, Sweden
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken
    Sure. UTF-8 is a unicode encoding - it covers every character in every human language and a little more. That's the nice thing about UTF-8 - once you made the shift and figured out to make it work, you don't have to worry about encodings anymore, since you can use the same for all langauges.
    Now, I am going off-topic but for instance:

    E.g. when I set my PHPAdmin to use sv-utf-8 I get sentences with strange chars: "...orsaken �r den att...". I guess that this is a config-issue but I have not been able to solve it for some time :s
    Jan Bolmeson, M.Sc. Engineering Physics, ZCE
    Join my network @ LinkedIn.com

  19. #19
    SitePoint Zealot johno's Avatar
    Join Date
    Sep 2003
    Location
    Bratislava, Slovakia
    Posts
    184
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by madwax
    Maybe it is just me - but I have had serious troubles with UTF-8 since the application is mainly non-english (swedish with signs as å (&aring ö (&ouml etc), so instead I use iso-8859-1. Is it possible to use these characters in UTF-8?
    I think that cant be a problem. I am using UTF-8 for all my websites with slovakian charaters like ó, ô, ä, ĺ, ľ, ŕ without any problems. Just make sure that your MySQL connection has good charset setup.
    Code:
    SET character_set_results=utf8
    SET character_set_connection=utf8
    SET character_set_client=utf8
    Quote Originally Posted by madwax
    Would be interesting - I just go by gut-feeling, but I would like to be proven wrong.
    I have been using MySQL with really complex queries. Several non-primary key joins with grouping with aggregation functions and logarithm, exponent calculations. Bayesian filtering query if you really want to know. Tables with ~10K rows. Performance around 0.2sec. (The query was scheduled, not run online.) Just to make an example what MySQL can handle.

    Quote Originally Posted by madwax
    Thanks for the help, but I am interested in this issue in an abstract level. The issue is not really an article table but a rather more complex iso-document-management-application where I use several joins, calculations and object-instantiations (hence my concern for performance issues)
    No problem, man. I've just wanted to see some complex queries you are talking about.

    From my point of view I just can't see any problematic query now, so I think its premature optimization. I probably don't have enough information about it, so that why I'm so radicaly defending DB normalization.
    Annotations support for PHP5
    TC/OPT™ Group Leader

  20. #20
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stereofrog
    You really cannot work with unicode in php (until version 6). Therefore, if you need only west-european languages to be supported, stick with ISO.

    iso-8859 includes most (west)-european letters.
    charsets continues to confuse me - would you mind elaborating that a bit ? I thought php's internal format was utf-8 ?

    There's a lengthy article at wactwiki : http://www.phpwact.org/php/i18n/charsets
    and I've read it a dozen of times, but still ...

  21. #21
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, basically the traditional codepage system is based on "one symbol = one byte" convention. The interpretation of concrete byte values depends on current codepage (locale). For example, the byte with numeric value 196 is treated as "A umlaut" in western locale and as "capital D" in cyrillic. In unicode every symbol (or "code point") has its own numeric value ("A umlaut" remains 196, and cyrillic "D" becomes 0x414). How this values are represented (= how many bytes each value occupy) depends on unicode encoding (aka "transformation format", not to confuse with locale encoding above) like utf-8, utf-16 etc.

    PHP is not unicode-capable, for Zend engine "string" is just a sequence of bytes. Unicode support should be there in php 6.

  22. #22

  23. #23
    SitePoint Addict timvw's Avatar
    Join Date
    Jan 2005
    Location
    Belgium
    Posts
    354
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Meaby it's important to note that you can handle in/output in UTF-8.
    With the iconv function, mb_string extension you can get most work done.

    The thing is that php doesn't support code in utf-8. Eg function get注册页面那个验证码 wouldn't work

    http://derickrethans.nl/files/php6-unicode.pdf
    http://www.cs.tut.fi/~jkorpela/chars.html

  24. #24
    SitePoint Guru 33degrees's Avatar
    Join Date
    May 2005
    Posts
    707
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by madwax
    Now, I am going off-topic but for instance:

    E.g. when I set my PHPAdmin to use sv-utf-8 I get sentences with strange chars: "...orsaken �r den att...". I guess that this is a config-issue but I have not been able to solve it for some time :s
    This is because the data in the database isn't in utf-8. To work effectively in utf-8, you have to make sure the character set of both your database and the pages your serve is utf-8. The former can be done by making a "SET NAMES utf8" as soon as you get a link (although it's ideal to set it in the db's config, this isn't always possible), while the latter can be done with both the headers sent, and with a meta tag. Another thing to keep in mind is to pass the character set whenever you call htmlentities, if not they get garbled.

    As for converting existing data, you could probably do something like

    UPDATE news SET title = CONVERT(title USING utf8);

    although I haven't tried this so I don't know if it works.

  25. #25
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    For the PHP developers who are not aquainted with the Unicode problem, you may find this demo interesting.

    It shows how people, who haven't heard about Unicode and developing international pages may have their webpages unaccessible for a large part of the intended audience.

    I have also met this problem in e-mail messages written in languages other than English. The received e-mail has lost all the encoding, and the unreadable garbish, can hardly be called e-mail.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •