One way to speed access to a large table is to partition the data. Oracle has partitioning and now so does SQL Server. I’ve worked with Oracle’s version for a while, but this is my first foray into the SQL Server variation.
I’m going to try to keep this as brief as possible, so I’m going to leave out much theory and background and leave that as an exercise for the reader and his friend Google.
In a nutshell partitioning means subdividing your table so that a query only has to spin through the subset of records, not the entire table.
In practice both Oracle and SQL Server allow you to divide your table up into more than one physical file, with certain ranges of data assigned to each file. This has speed advantages when querying a subset, but it also has management advantages since you can perform tasks like backups and archives on individual files instead of having to test every record in the table to see if it meets your criteria. Being included in the file it is in means that it meets the criteria.
I’d say most people who look to partitioning are doing so because they want to speed up their queries. Fair enough. Will it work? If so, how much speed will be gained? There’s really only one way to find out, right? Marketing materials! No, wait, I mean, Experimentation!
For my experiment I start with a real table to which I need to speed access. The table has about 32 million records and 33 fields.
The table has historical data that is broken into six month ranges. In other words, you will typically only need to access six months worth of the data at a time, but it could be any six month period in the last two years. Without partitioning, the dbms has to spin through all two years for every query. With partitioning it only has to spin through the relevant six month period.
The create script (with names changed to protect the innocent) is as follows:
The key is the very last line. testTablePS is the Partition Scheme to use. It takes the column name [g] as a parameter. It feeds this into the Partition Function to determine which partition a given record is in based on its value in the g column (in this case a datetime).
Each partition is in its own file in its own filegroup.
Now for the speed tests.
To begin with we’ll try WITHOUT indexing. Indexing will most likely improve the results, but I’d like a baseline first. This pretty much narrows things down to just the difference between partitioning and not partitioning.
The query I’ll use is this:
This means we are looking at all the fields each time.
Here are the results (~2.4 million rows returned):
|Query Run Time 1st run||3:45||1:22|
|Query Run Time 2nd run||2:42||1:16|
|Query Run Time 3rd run||2:40||1:02|
Predictably the first run is the slowest. Subsequent runs are a bit more consistent.
As you can see the partitioned table returns about twice as fast as the non-partitioned table.
Now lets try returning a specific date.
The query will be:
Those results are here (~470,000 rows returned):
|Query Run Time 1st run||1:54||0:14|
|Query Run Time 2nd run||1:52||0:12|
|Query Run Time 3rd run||1:54||0:12|
Now let’s look at what happens if we take the return and display of the data out of the equation.
The results are here:
|Query Run Time 1st run||1:41||0:06|
|Query Run Time 2nd run||1:43||0:00|
|Query Run Time 3rd run||1:43||0:00|
It’s hard to find fault with less than a second…
Now a HUGE caveat… You knew there had to be one, right?
What if you write a query that doesn’t limit by the column on which the partition is based? Poor you. You will have the same OR WORSE results from the partitioned table that you have from the non-partitioned table. Why? Well now the DBMS has to look through several files to gather up the data instead of just one. This is especially noticable on big joins between tables.
Partitioning can be a big help, but it is definitely not a silver bullet.
That wraps it up for this post.
Next post I’ll add some indexing and rerun the queries.