SitePoint Sponsor

User Tag List

Results 1 to 18 of 18
  1. #1
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy Character Encoding/Special Characters

    HI everyone, I'm hoping that some of the guys here can shed a light to the problem.

    Were planning a project, and when we were discussing the features, an issue came up. See, our data will come from aggregators who collects them from various people who supplies them. Some of their data might contain special characters like ╚ or î or ┘, or something arabic, or anything other than the standard keyboard has to offer.

    Part of the project will heavily rely on search, and there the problem arises. If someone spelled the string "╚spoir" or "Espoir", how will we deal with that? What do you suggest we save in the database? what do you suggest we do with the string from the searchbox when the search is called?

    anyone here experienced something similar? will really appreciate all your ideas.
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,159
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I would recommend using UTF-8 throughout, if at all possible. You may want to read The Definitive Guide to Web Character Encoding if you haven't already done so.
    Birnam wood is come to Dunsinane

  3. #3
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Thanks AutisticCuckoo!

    ok i will give it a read , but for the meantime, what is your opinio on the following scenario.

    we have data which comes from aggregators, sample, a title written as "╔spoir" which is stored nicely on my utf-8 mysql database. now casual users of the site who is not French, will, by default, type in "Espoir" because we cant expect all of them to produce ╔. I havent tried this, but will this yield a result on my query?
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lance_vincent View Post
    we have data which comes from aggregators, sample, a title written as "Éspoir" which is stored nicely on my utf-8 mysql database. now casual users of the site who is not French, will, by default, type in "Espoir" because we cant expect all of them to produce É. I havent tried this, but will this yield a result on my query?
    Which search-engine are you planning to use? When indexing words, you would probably want to transcribe both é and e into e. I'm not sure, but I suppose you could use the indexers stemming library for this, if it has one built in.

  5. #5
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Quote Originally Posted by kyberfabrikken View Post
    Which search-engine are you planning to use? When indexing words, you would probably want to transcribe both Ú and e into e. I'm not sure, but I suppose you could use the indexers stemming library for this, if it has one built in.
    thanks for the reply kyberfabrikken!

    sorry i didn't make it all clear eariler. im not really worried about the search engine here, but since we also plan to use seo on the project, then we'll have to consider it from this point.

    back to the sample problem. Once we have the data in our own database, we will have to create a search facility on our site, which will match user input into our database, search-engine aside. So if my database contains a title "Úspoir", and the user entered "espoir", in theory my query will return 0, right?

    any suggestions on tackling these kinds of problems?
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  6. #6
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,159
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You should index everything twice: once as aggregated (éspoir) and once stripped of all diacritical marks (espoir). If an exact search doesn't match, strip the search terms as well and match against the stripped index.

    This will give some false positives for certain languages, but that's better than forcing everyone to type exact search terms. In Swedish, for instance, 'å', 'ä' and 'ö' are separate letters, not merely accented versions of 'a' and 'o'. If you search for 'väg' (road) you don't really want matches for 'vag' (vague). A Swede in Sweden wouldn't type 'vag' as the search word if he was looking for 'väg', but a Swedish-speaking person outside of Sweden might, if their keyboard didn't have a key for 'ä'.
    Birnam wood is come to Dunsinane

  7. #7
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lance_vincent View Post
    sorry i didn't make it all clear eariler. im not really worried about the search engine here, but since we also plan to use seo on the project, then we'll have to consider it from this point.
    By search-engine, I didn't mean eg. Google. I meant the library that you use to index/query your data. Such as Lucene or Sphinx or even a hand-rolled one. You surely don't want to use MySql's built-in full-text engine for any kind of serious search.

  8. #8
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Thank you both for replying. ill be giving more detail about the situation and the solution i was able to come up with, please guide me when you think I'm not on the idea path.

    Quote Originally Posted by AutisticCuckoo View Post
    You should index everything twice: once as aggregated (éspoir) and once stripped of all diacritical marks (espoir). If an exact search doesn't match, strip the search terms as well and match against the stripped index.

    This will give some false positives for certain languages, but that's better than forcing everyone to type exact search terms. In Swedish, for instance, 'å', 'ä' and 'ö' are separate letters, not merely accented versions of 'a' and 'o'. If you search for 'väg' (road) you don't really want matches for 'vag' (vague). A Swede in Sweden wouldn't type 'vag' as the search word if he was looking for 'väg', but a Swedish-speaking person outside of Sweden might, if their keyboard didn't have a key for 'ä'.
    Heres what weve planned so far:

    The data that we will be dealing with is a block of address, and we already have an idea what different city names are spelled. We will then take into consideration that aggregators will be supplying us with special characters in addresses.

    now, the project were planning will offer 2 ways to search, not making use of any full-text 3rd party search, but plain mysql search. the reason (here goes, hope i don't sound stupid) I'm opting not to use them is that I'm not going to match a string to a lengthy blog-type field, just short addresses.

    first option is the "country + textbox" where they pick a country from a selectbox then type a particular keyword for the address. This seems simple and will work just fine, but, as AutisticCuckoo mentioned, people outside Sweden cant type in 'ä' if they are not using a keyboard which supports it. so we have option 2,

    second option is a drill down of select boxes of place-names, in the following fashion

    countryname -> region/province/state -> city/town -> district

    (and surprisingly, we are covering the whole world, imagine that!)

    anyway, i think kyero.com model which i found is the best sample. you choose what country, pop comes in the 2nd selectbox with all regions, click a region then pop goes all cities under it, and boom, all districts under it.

    so if one chose 'Holmsjö', my sql should do something like 'Holmsjö' OR 'Holmsjo', so i can try to cover all possible spellings.

    Sorry guys, that was a bit lenghty, but if you think i mentioned something wrong, or if you have a better idea, please do comment.

    to autiosticCuckoo, the idea of indexing the same data twice, i think will make the database bigger. and, sample i have 2 fields for spelling1 and spelling2, my sql will be:

    Code:
    spelling1 = 'Holmsjö' OR spelling2 = 'Holmsjö'
    but if i could create a function to check the string for special char and then replace them and just do something like

    Code:
    spelling1 = 'Holmsjö' OR spelling1 = 'Holmsjo'
    i think it will be easier? no? this is driving me crazy!
    Last edited by lance_vincent; Sep 3, 2008 at 21:09. Reason: put [code] tag for better readability
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  9. #9
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,159
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Replacing the characters at run-time for each indexed value will save some database space but increase the server load. Depending on how busy the system will be, that might be acceptable. Unless you have hundreds of millions of addresses, using two columns in your table might be better, especially if you expect lots of simultaneous search requests.
    Birnam wood is come to Dunsinane

  10. #10
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    As Tommy already said, disk space is usually much cheaper than processing.

    You strategy will work if data is reasonably small and traffic is light. The problem is that the database needs to scan through all the data on each search, which is very inefficient. Even if you don't want to mess around with a full search engine, you could simply roll your own. As long as you don't need a lot of advanced features, it's quite simple really. Here's a rough sketch to get you started:

    Code sql:
    -- table that contains your data
    CREATE TABLE addresses (
      address_id serial,
      street_name VARCHAR(255)
    );
    -- index of words
    CREATE TABLE addresses_index (
      index_id serial,
      word CHAR(100),
      INDEX (word)
    );
    -- relation between index and records
    CREATE TABLE addresses_rel_index (
      index_id BIGINT,
      address_id BIGINT,
      weight INT DEFAULT 10,
      PRIMARY KEY (index_id, address_id),
      FOREIGN KEY (index_id) REFERENCES addresses_index (index_id),
      FOREIGN KEY (address_id) REFERENCES addresses (address_id)
    );
    Code sql:
    -- test data
    INSERT INTO addresses VALUES (1, "Holmsj÷ 34") ;
    INSERT INTO addresses VALUES (2, "Pjort 12") ;
    INSERT INTO addresses VALUES (3, "Lille Holmsj÷ 8") ;
     
    -- test index data
    INSERT INTO addresses_index VALUES (1, "holmsj÷") ;
    INSERT INTO addresses_index VALUES (2, "holmsjo") ;
    INSERT INTO addresses_index VALUES (3, "pjort") ;
    INSERT INTO addresses_index VALUES (4, "lille") ;
    INSERT INTO addresses_rel_index VALUES (1, 1, 10) ;
    INSERT INTO addresses_rel_index VALUES (2, 1, 5) ;
    INSERT INTO addresses_rel_index VALUES (3, 2, 10) ;
    INSERT INTO addresses_rel_index VALUES (1, 3, 10) ;
    INSERT INTO addresses_rel_index VALUES (2, 3, 5) ;
    INSERT INTO addresses_rel_index VALUES (4, 3, 10) ;
    Code sql:
    -- test search
    SELECT addresses.*
    FROM addresses_index
    JOIN addresses_rel_index
    ON addresses_index.index_id = addresses_rel_index.index_id
    JOIN addresses
    ON addresses_rel_index.address_id = addresses.address_id
    WHERE addresses_index.word IN ("lille", "holmsj÷", "holmsjo")
    GROUP BY addresses.address_id
    ORDER BY SUM(addresses_rel_index.weight) DESC ;

  11. #11
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by AutisticCuckoo View Post
    Replacing the characters at run-time for each indexed value will save some database space but increase the server load. Depending on how busy the system will be, that might be acceptable. Unless you have hundreds of millions of addresses, using two columns in your table might be better, especially if you expect lots of simultaneous search requests.
    actually, some place names, like some cities have 2 names. so what i have in mind is like

    Code:
    standard_spelling            |   alternate_spelling
    Skavkulla och Skillingenäs   |   Skavkulla, Skillingenäs, Skillingenas
    so in actualitly, the 2nd field is possible to contain more 1 or more alternate name, so my approach is to use %% on the string when i search. geeezzz everything gets complicated the more i think about it.

    The negative point in having 2 fields, is that we have to scan all our list, find special chars, and input the proper alternate names, which is tedious work.

    Compared to having your local script, like php, to check the string first then replace, then query. But my approach would need to list all special chars in a function in order to replace them when needed, which is what you meant by strain on the processor, ouch!

    hmm, i still have to weigh things before i decide. thanks for the suggestion. i still haven't made my mind, ill think about this when i get home.
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  12. #12
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Quote Originally Posted by kyberfabrikken View Post
    As Tommy already said, disk space is usually much cheaper than processing.

    You strategy will work if data is reasonably small and traffic is light. The problem is that the database needs to scan through all the data on each search, which is very inefficient. Even if you don't want to mess around with a full search engine, you could simply roll your own. As long as you don't need a lot of advanced features, it's quite simple really. Here's a rough sketch to get you started:
    Thanks kyberfabrikken for going that much to provide me with a suggestion! really appreciate it, though i must admit, im not familiar with the later part of the second sql insert, plus the sql, which involves 3 tables to join? i have an experience using joins way back then, and it was taking too long for a normal search compared to using the basic join table.id = table2.id. Maybe it was my poor sql statement to blame.

    for the number of data/records, it will be a good deal, actually this is a property listing site in plan, on international scale. so if its successful, you can imagine all data on rightmove, plus american properties, plus international properties listed! so pretty much its going to hold a great deal of records!

    maybe i need to review this basic indexing before i make any final decision. but as i stand, im not expert on this, and i cant approach problems concerning indexing in the future if the need may arise as im inexperience. i cant always rely on forums when the site is running when it encounters a problem, thats why im hesitating a bit.
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lance_vincent View Post
    ... though i must admit, im not familiar with the later part of the second sql insert, plus the sql, which involves 3 tables to join? i have an experience using joins way back then, and it was taking too long for a normal search compared to using the basic join table.id = table2.id. Maybe it was my poor sql statement to blame.
    Sorry to break it to you, but that is pretty basic database knowledge. If you're building a large site, you better get up to speed on such things. The database is most likely going to be the bottleneck in your application, so knowing how to optimise it is rather essential.

  14. #14
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Smile

    Quote Originally Posted by kyberfabrikken View Post
    Sorry to break it to you, but that is pretty basic database knowledge. If you're building a large site, you better get up to speed on such things. The database is most likely going to be the bottleneck in your application, so knowing how to optimise it is rather essential.
    When I started programming, that was the time i tried the Joins, I passed my work to my senior database programmer, and instructed me to change it to the basic table1.id=table2.id when the performance increased, thats why i dont use joins too frequently.

    anyway, i took home a print out of your suggested database structure and sample data, and plot them on paper, talk about unpaid-overtime. its a good idea really, so it needs to index all possible words and have some sort of weighting to determine what goes first. meaning, everytime I insert a new record, i check the address field and index new words in it, am I on the right track? So this is what you meant by not using a search-engine/full-text/lucene/sphinx? can i just ask why didnt you suggest making use of mysql's bulit in fulltext?

    if my database structure is something like

    record_id
    country_id
    address_text
    price
    etc

    cant i just index the field address_text?

    thanks for sharing what you know, i don't mind feeling incompetent with your level so long as i learn something new! everyone needs to start being noob at everything, hehehe. i appreciate your suggestions very much!
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  15. #15
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lance_vincent View Post
    When I started programming, that was the time i tried the Joins, I passed my work to my senior database programmer, and instructed me to change it to the basic table1.id=table2.id when the performance increased, thats why i dont use joins too frequently.
    Well, databases are complex. What is smart in one setting, may be hopeless in another. Just avoiding joins is not a sane strategy though. You need to figure out when they are good and when they are bad. The EXPLAIN command comes in handy here.

    Quote Originally Posted by lance_vincent View Post
    ... so it needs to index all possible words and have some sort of weighting to determine what goes first. meaning, everytime I insert a new record, i check the address field and index new words in it, am I on the right track?
    Yes, that's the idea.

    Quote Originally Posted by lance_vincent View Post
    can i just ask why didnt you suggest making use of mysql's bulit in fulltext?
    Because it's notoriously slow/inefficient. Now admittedly it's been a while since I mocked around with it, and it may have improved since. It may be a good idea to run some benchmarks with a realistic amount of data/load, to determine which solution is better.

  16. #16
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hmmm, thanks for the comments. i edited your sql to the one below cause the one you posted returned a minor error.

    Code:
    SELECT addresses. * , sum( addresses_rel_index.weight ) AS sum
    FROM addresses_index
    JOIN addresses_rel_index ON addresses_index.index_id = addresses_rel_index.index_id
    JOIN addresses ON addresses_rel_index.address_id = addresses.address_id
    WHERE addresses_index.word
    IN (
    "lille", "holmsj÷", "holmsjo"
    )
    GROUP BY addresses.address_id
    ORDER BY sum DESC
    the one you gave me returned an error about invalid use of GROUP func, so i decided to edit it to see how it will work.

    again thanks for the comments, i am now leaning to go for your suggestion, though a bit messy when indexing, if its faster when its up and running then i think its worth it. i will give mysql a read if the speed has improved though. ill be printing out lots of document for the weekend. this thread is the first thing ill check back on monday!
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  17. #17
    SitePoint Evangelist lance_vincent's Avatar
    Join Date
    Aug 2004
    Location
    philippines
    Posts
    574
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hmmm, thanks for the comments. i edited your sql to the one below cause the one you posted returned a minor error.

    Code:
    SELECT addresses. * , sum( addresses_rel_index.weight ) AS sum
    FROM addresses_index
    JOIN addresses_rel_index ON addresses_index.index_id = addresses_rel_index.index_id
    JOIN addresses ON addresses_rel_index.address_id = addresses.address_id
    WHERE addresses_index.word
    IN (
    "lille", "holmsj÷", "holmsjo"
    )
    GROUP BY addresses.address_id
    ORDER BY sum DESC
    the one you gave me returned an error about invalid use of GROUP func, so i decided to edit it to see how it will work.

    again thanks for the comments, i am now leaning to go for your suggestion, though a bit messy when indexing, if its faster when its up and running then i think its worth it. i will give mysql a read if the speed has improved though. ill be printing out lots of document for the weekend. this thread is the first thing ill check back on monday!
    If you won't dress like the
    Victoria Secret girls,
    don't expect us to act like soap opera guys.

  18. #18
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You might want to post a question to the MySql forum as well. There are some people in there that really know their SQL and may be able to give better advice than I.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •