SitePoint Sponsor

User Tag List

Results 1 to 24 of 24
  1. #1
    SitePoint Member
    Join Date
    Sep 2005
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    ip2country optimize

    Hello!
    i want to install ip2country db on hosting
    and aprox 50 000 queries per day, each 100 bytes

    naturally it will consume shared hosting CPU
    is there a way to optimize this DB to lessen CPU load?
    postgres is better than mysql ?
    any experiences?

    help! need advice

  2. #2
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hi,
    i am trying to achieve the same goal.
    i currently use the free ip2country database imported in mysql.
    On a daily basis, I run about a million queries against this table.
    Due to the format of the table (each ip block identified by lower and higher ip), one cannot query the table for multiple IPs at a time, neither use indexes (because of the use of < and > in the where clause). I think this also prevents from using in-memory HEAP tables.
    This results in the process taking quite some time even on a speedy machine.

    I'd be curious as to any performance improvement possible as well. The way I see it, better performance could only be achieved by using a non sql database, and use the known formatting of the table and good algorithms to avoid full table scans.
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  3. #3
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just an idea, create a table with the first byte of the ip, and foreign key to the possible ranges?

  4. #4
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    try using inet_aton() and inet_ntoa() in your tables. these twon functions convert IP's to and from integers so your range comparisons can successfully use an index.

  5. #5
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    actually, that's the way it is already. I was under the impression that mysql did not use indexes when having a comparison in where clauses, are you sure it does ?
    If yes, what I don't get, is why i get the following:

    Code:
    EXPLAIN SELECT * 
    FROM `iptocountry` 
    WHERE `l` <= 3558292872 AND `h` >= 3558292872  
    
    table  type  possible_keys  key  key_len  ref  rows  Extra  
    iptocountry ALL l,h NULL NULL NULL 16929 Using where
    The table has indexes on both l and h (high and low value of the ip block, and only returns one result (which is normal), yet the explains says that 16000 rows (full table scan) have to be looped through ?
    If it used indexes correcty, shouldn't it be able to find the right row right away ?
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  6. #6
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    after making a few tests, it seems that the number of rows looked at depend on the number one compares to. Probably is there some sort of optimization algorithm or something.

    Anyway, it doesn't seem optimized enough, because I have written a small php script that searches through a text file, using a simple dichotomy algorithm and I'm having much better results.
    On a 60000+ lines database (from webhosting.info ), a SQL query like the above takes (on my machine) between 0.0006 s (when the number is such that only a few lines are looked through) and 0.02s (when full table scan).
    My dichotomic script returns systematically results in around 0.0003s, having performed 13 to 18 iterations.

    Unless there's something i've missed on the index usage, i'm probably going to switch my method of operation, hopefully dividing at least in two the time taken by my batch.

    Let me know if anyone's interested in reading the script (it's fairly simple) and i'll post it.
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  7. #7
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, the maxmind geoip php code does something similar I believe.

  8. #8
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    this index might actually get used:
    Code:
    alter table add index `l_h` (`l`, `h`);
    also try this:
    Code:
    SELECT *
    FROM `iptocountry`
    WHERE 3558292872 BETWEEN `l` AND `h`

  9. #9
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Seems MySQL refuses to use the index. Even when forced.

    CREATE TABLE `iptocountry` (
    `l` bigint(20) unsigned NOT NULL default '0',
    `h` bigint(20) unsigned NOT NULL default '0',
    `isoCode` char(2) NOT NULL default '',
    PRIMARY KEY (`l`,`h`)
    ) ENGINE=MyISAM;

    With 37879 rows.

    EXPLAIN SELECT * FROM iptocountry force index(primary) WHERE 3558292872 BETWEEN l AND h

    1, 'SIMPLE', 'iptocountry', 'ALL', '', '', '', '', 37879, 'Using where'


    [Using data from http://www.ip2country.net/download/cr.zip ]

  10. #10
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ren, don't you experiment that changing the number you compare to produces different results?
    IP tables have a somewhat strange repartition scheme, which may be the reason for disparities.
    This is also the reason why I don't get better results with other algorithms than dichotomy. I've tried the false position and tangent algorithms, which are supposed to converge faster than dichotomy, but the repartition of the IP blocks seems to produce weird results with those methods.

    Anyway, I've modified my script a bit to put the whole file in memory and search php strings/arrays rather than the file, and I'm now getting results around 0.0002s per query (vs 0.0003s searching the file and 0.0006s-to 0.003s with sql).
    Getting closer. Might experiment with a binary version and then i should be done .
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  11. #11
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by quenting
    Ren, don't you experiment that changing the number you compare to produces different results?
    IP tables have a somewhat strange repartition scheme, which may be the reason for disparities.
    This is also the reason why I don't get better results with other algorithms than dichotomy. I've tried the false position and tangent algorithms, which are supposed to converge faster than dichotomy, but the repartition of the IP blocks seems to produce weird results with those methods.

    Anyway, I've modified my script a bit to put the whole file in memory and search php strings/arrays rather than the file, and I'm now getting results around 0.0002s per query (vs 0.0003s searching the file and 0.0006s-to 0.003s with sql).
    Getting closer. Might experiment with a binary version and then i should be done .
    I've been using the MaxMind geoip, as well that was the first IP to country I've found. Just wondering how both data sets compare for accuracy etc.

    http://cvs.sourceforge.net/viewcvs.p...25&view=markup

    _geoip_seek_country() does the search.

    But as I need to do lookups on batchs of IPs, guessing that SQL version will perform better.

  12. #12
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Code:
    alter table iptocountry add index full (`l`, `h`, iso_code);
    the problem is that mysql reverts to a table scan when it thinks that approxiamtely 30% of the table's rows will be returned. based on the statistics that mysql has about the table, it thinks that you query is going to return 16k rows, which is hlaf of teh 37k in the table. but by adding the iso_code column, it also knows that it can complete the query using only the indexes so it ignores the 30% rule.

  13. #13
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    alter table iptocountry add index full (`l`, `h`, iso_code);
    I've just tried that, but it produces exactly the same result (indexes not used, or only partially). The phenomenon you explain seems likely to be correct, but the solution doesn't work .

    I've been using the MaxMind geoip, as well that was the first IP to country I've found. Just wondering how both data sets compare for accuracy etc.
    I'd be curious to compare as well. I've switched recently to the one provided by webhosting.info, however it's not perfectly accurate (although better than the ip2country i used before). An accurate one (including for the coty code) is the geobytes one, it's the only one i've come across that found correctly the city i live in, but it isn't free, and i'm not sure i need the extra precision for what i'm doing.


    But as I need to do lookups on batchs of IPs, guessing that SQL version will perform better.
    I do need to do lookups on batches of IP as well, but I can't find a way to make sql faster. Due to the conversion of IPs to numbers first, and most importantly due to the double comparison in the where clause of the queries, I haven't been able to find away to query more than one IP at onces. If you have, I'm really curious how you're doing.


    I'd like to try out maxmind and see how it compares to webhosting.info's, but it seems it's not free either. Does anyone know a freely available databases including ip to city correspondance ?
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  14. #14
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    ah, forgot to mention that the columns have to appear in the order of the index. so your query has to be this:
    Code:
    SELECT iso_code
    FROM `iptocountry`
    WHERE 3558292872 BETWEEN `l` AND `h`

  15. #15
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by quenting
    I've just tried that, but it produces exactly the same result (indexes not used, or only partially). The phenomenon you explain seems likely to be correct, but the solution doesn't work .



    I'd be curious to compare as well. I've switched recently to the one provided by webhosting.info, however it's not perfectly accurate (although better than the ip2country i used before). An accurate one (including for the coty code) is the geobytes one, it's the only one i've come across that found correctly the city i live in, but it isn't free, and i'm not sure i need the extra precision for what i'm doing.

    I do need to do lookups on batches of IP as well, but I can't find a way to make sql faster. Due to the conversion of IPs to numbers first, and most importantly due to the double comparison in the where clause of the queries, I haven't been able to find away to query more than one IP at onces. If you have, I'm really curious how you're doing.

    I'd like to try out maxmind and see how it compares to webhosting.info's, but it seems it's not free either. Does anyone know a freely available databases including ip to city correspondance ?
    Maxmind do a free version of their binary .dat file.. "GeoLite".

    Unfortunately cant get to their website atm to find the download link.

    I'm importing web log files, so each line gets imported, atm I'm doing the country lookup in PHP, as each line gets imported. But plan to remove it from the loop, and have a single INSERT when the raw data has been imported.
    Something along the lines of...

    INSERT table SELECT l.logId, i.countryId FROM logs l INNER JOIN l.clientIp BETWEEN i.l AND i.h

  16. #16
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ah, forgot to mention that the columns have to appear in the order of the index. so your query has to be this:
    still not working for me...

    Code:
    EXPLAIN SELECT code
    FROM `iptocountry` 
    WHERE 3558292872 
    BETWEEN `l` AND `h`  
    
    table  type  possible_keys  key  key_len  ref  rows  Extra  
    iptocountry index NULL FULL 18 NULL 63370 Using where; Using index
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  17. #17
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just had a play with creating a intermediary table, determining country for 1,438 ips.

    SELECT l.logId, l.clientIp, i.isoCode
    FROM tracker.logs l
    INNER JOIN octet o ON l.clientIp >> 24 = o.classA
    INNER JOIN iptocountry i ON o.rangeId = i.rangeId AND l.clientIp BETWEEN i.l AND i.h

    id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
    1, 'SIMPLE', 'l', 'ALL', 'Index_2', '', '', '', 1438, ''
    1, 'SIMPLE', 'o', 'ref', 'PRIMARY', 'PRIMARY', '1', 'func', 1514, 'Using where; Using index'
    1, 'SIMPLE', 'i', 'eq_ref', 'PRIMARY', 'PRIMARY', '2', 'test.o.rangeId', 1, 'Using where'

    Which appears alot (25x ) better than

    SELECT l.logId, l.clientIp, i.isoCode
    FROM tracker.logs l
    INNER JOIN iptocountry i ON l.clientIp BETWEEN i.l AND i.h

    id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
    1, 'SIMPLE', 'l', 'ALL', 'Index_2', '', '', '', 1438, ''
    1, 'SIMPLE', 'i', 'index', '', 'Index_Full', '18', '', 37879, 'Using where; Using index'

    Where Index_Full is an index on (l, h, isoCode)

  18. #18
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by quenting
    still not working for me...
    Code:
    table  type  possible_keys  key  key_len  ref  rows  Extra  
    iptocountry index NULL FULL 18 NULL 63370 Using where; Using index
    huh? look at the end of that explain... it says using index!

  19. #19
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    it says using index!
    well, it says using index, but it still looks through 63370 rows (full table scan) when only one matches the query. My definition of correct index usage is that it only has to look up the records that end up in the result set.
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  20. #20
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    the 63370 in the explain is NOT an exact number. that is an ESTIMATE of the number of rows that the index will return based on the key distribution. it has nothing to do with the actual number of rows looked at, although some times it just happens to be the same as the number of rows returned from that table, but that is coincidence, not a mathematical or procedural certainty.

  21. #21
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by longneck
    the 63370 in the explain is NOT an exact number. that is an ESTIMATE of the number of rows that the index will return based on the key distribution. it has nothing to do with the actual number of rows looked at, although some times it just happens to be the same as the number of rows returned from that table, but that is coincidence, not a mathematical or procedural certainty.
    Yes, but the performance is still poor even if it is using an index.

    932 rows in set (32.99 sec)

    vs

    932 rows in set (9.16 sec)

    Using the pair of queries post above. Latter timing using the extra table.

  22. #22
    SitePoint Guru quenting's Avatar
    Join Date
    Dec 2002
    Location
    Switzerland
    Posts
    735
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by longneck
    the 63370 in the explain is NOT an exact number. that is an ESTIMATE of the number of rows that the index will return based on the key distribution. it has nothing to do with the actual number of rows looked at, although some times it just happens to be the same as the number of rows returned from that table, but that is coincidence, not a mathematical or procedural certainty.

    Still, if it ends up returning one row, and the estimate was 63370, it means the index doesn't work as it should. And if it estimates to get 63370 in the result set, it will go through a full table scan (because it will think it has to) won't it?
    What's weird is the query times, I now get 0.002s for the same query that takes 0.0006s without the index.
    The largest message boards on the web !
    unblog.fr, hosting 700000 french blogs

  23. #23
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    i'm stumped. if anyone else want's to take a crack at it, i've attached a SQL file that creates a table of random ip addresses. you can download the ip2country sql file from http://www.ip2country.net/download/cr.zip (look for a file called anp_ip2country.sql)

  24. #24
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    i guess i need to actually attach the file....
    Attached Files Attached Files


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •