Large table searching & Performance

Hello all,

I have a question about how to best approach a performance issue I am having. I inherited a database schema that I need to work with. The tables look like this:

tablename_1, MyISAM(1.4 million rows)
tablename_2, MyISAM(1.4 million rows)
tablename_3, MyISAM(1.4 million rows)

…and so on. These tables are dynamically built by an application and the _1 (for reference, lets call it identifier) piece of the table name is incrementing. They data is not changed once stored. My task is to write an application that will query the different tables based on identifier. So in my query, I will get what value to use for the identifier (ex: 1,2 etc) and I will need to query tablename_1 or tablename_2.

I created a procedure with the only solution that came to mind, which is CONCAT the different pieces of the SQL query, PREPARE, and EXECUTE the query. This is proving to be VERY costly when it comes to performance. The application I am writing needs to seek data from this table at a minimum of 800 transactions per second.

Currently, when I hardcode the tablename_1 in my SQL query and execute, I am getting around 1500 per second. When I use the CONCAT method, it slows down drastically. Is there any way to overcome this obstacle? If the DB schema needs to be changed (which is in consideration), what would be a good way to store around 20 million records? Keep in mind, it would need to be searched at with a minimum of 800 per second.

Thanks for your time & all help! Please let me know if any other details are needed. I was trying to keep the post short and to the point.

You are not telling us how many tables there is, nor what data they contain (both content and columns). We dont know how your system decide when it is time to branch off into another table.

In addition you are not telling us how the queries will need to be set. Will it join multiple tables or will it only pull one table at a time.

Without the answers to the questions above, and perhaps some more details all we can do is put our finger into the air and guess. As you know, doing that seldom get you a workable solution.

I have dealt with similar issues in the past due to a bad database schema, but at a much larger scale. In our case we had hundred of thousands of tables in a database due to a similar “branching system”. What we ended up doing was first writing a script to move these tables across a lot of databases. With other words the script now branch into tables, but also into new databases when a database reach a limit. However we did not modify anything else than the SQL class of the application, using a few regexes that decide what database the system should connect to as well as which table.

So unless you have the time available to update the application, fixing it at the core, a similar solution to what I mentioned above might be your best bet.

Hey,

Thanks for the reply! Here are some more additional details. The whole database contains 30 tables. The tables are created each time a file is loaded into the application. The application will create a new table and copy the contents of the file into the database (with some manipulation etc). The number of tables are not growing (meaning there is a CAP in the application as to how many files can be loaded, thus max of 30 tables). The database will forever only have max of 30 tables.

The application I am creating does a simple query. Ex: SELECT route FROM tablename_1 WHERE singal = @intSignal AND network = @intNetwork

Here is the example of the schema for the table:

CREATE TABLE tablename_1 (
id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
route MEDIUMINT(5) UNSIGNED NOT NULL,
signal MEDIUMINT(8) UNSIGNED NOT NULL,
network MEDIUMINT(8) UNSIGNED NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY signal (signal,network)
) ENGINE=MYISAM AUTO_INCREMENT=11770900 DEFAULT CHARSET=latin1


I am pushing to have the tables combined into one table. I think I will be able to get above 800 TPS with a simple query on a table with 20 million records (with good indexing). Do you agree?

I just got back to the office, and I dont have time to reply properly before later tonight.

If you have a chance in the mean time, could you post 3-4 records from the database? As I might have a way you can seriously increase the speed of your queries if the data is as they seem to be.

I am glad to say that after combining all tables and properly indexing the table, I was able to reach 1500 TPS. I have another query that I would like to optimize. I will be creating a thread on that later. I hope to see you there.

Thanks for your help!

You might be able to speed it up even more by doing this.

Add two new columns, “signal_bucket” and “network_bucket” both is TINYINT UNSIGNED NULL (has to be null initially if the table is populated).

Then you add a composite key (index) consisting of signal_bucket and network_bucket. Make certain you create the index in the direction you plan to use the columns in the query.

After they have been added you run this query:


UPDATE
	table
SET
	signal_bucket=signal>>16
	, network_bucket=network>>16

Then finally when running the query above you would make it like this instead:


SELECT
	route
FROM
	table
WHERE
	signal_bucket=@intSignal>>16
	AND network_bucket=@intNetwork>>16
	AND signal=@intSignal
	AND network=@intNetwork

Of course keep in mind I dont know the format of your data other than its digits and max length of the number is eight digits. So you would need to test this to verify it works. In theory this approach should speed up the query significantly.

huh? it’s a BIGINT UNSIGNED, so the maximum number is 18,446,744,073,709,551,615, i.e. 20 digits, not 8

but i know what you were trying to say… 8 bytes

could you explain how the shift right by 16 bits works, and maybe why you chose 16

You got me confused for a second there Ruby, when I said max length of eight digits, I meant for the signal and network column which is unsigned mediumint, which if my memory does not fail me is around 16.7-16.8millions (sorry I dont remember the exact size of it).

Just noticed the bigint used on the primary key now, considering the max limit of files the thread author mentioned the column type should be good as int and that would save a tiny fraction of space as well.

For the binary functionality, its just something I’ve pulled with from back when we programmed in assembly at high school. The syntax is not the same but the concept is.

What I’m doing here is to limit the query using part of the initial number. I’ve used it in the past with good results, but it is not always it will improve the speed. From my experience, the larger the table, the larger the speed gain.

The >> function is a “Bitwise” (binary) operator, that basically means “shift to right” using the number to the right of it, or in more layman terms “divide by two” as many times as the number to the right.

For an example lets take this decimal number:
678321

Internally in a computer its read as binary which leaves it:
10100101100110110001

If we do a shift to right sixteen times on this number we remove from the right side the first 16 numbers, leaving us with:
1010

Or in decimal:
10

So we are just limiting the search to a smaller portion of the records.

Though for why I picked to shift 16 times, there is not really any good reason for that other than it gives you around as many possibilities on the numbers as mediumint offers. It could be an idea to test different numbers here, as the higher the number is the more records will be inside one of these, and the fewer times you shift the less records each will contain. Though I would only go down, and not up. One important thing to note there is if you try that, make certain to increase the size of the bucket columns.

Sorry for not being able to explain it better, I certainly do not have a teacher lying dormant in me :slight_smile:

Let me know if you got any questions.

i’m sorry, but i’m gonna have to disagree with you on your last point, the teacher is definitely awake

:slight_smile: