Você está na página 1de 5

Managing Large Data Sets in SQL Server 2005 and 2008

By Zach Mided, 2010/01/04


Total article views: 6886 | Views in the last 30 days: 6886
Rate this | Join the discussion | Briefcase | Print

Overview
One of SQL Server's strengths is its ability to operate against multiple rows using just one
statement, but in excess that strength can be problematic. For example, performing an insert of
20 million rows will place numerous page locks and possibly a table lock on that table, which can
prevent others from effectively using any applications that access that table. Alternatively, one
could use cursors to modify one row at a time, but the performance of cursors is not great and is
not a good solution for processing millions of rows.
This article describes techniques that can be used to make operations against large data sets
more manageable. You will be able to stop and start operations at a whim and throttle them up or
down depending on how heavily you may want to use system resources at any given time.

'Tis the Season...


In some situations you can schedule operations against large data sets to occur off-peak hours,
but what if your system has to be very responsive around-the-clock? Or, what if the operation
takes so long that it can't be completed entirely before peak-hours?
Imagine the following scenario. Your nightly loads of large volumes of data usually complete
before peak-hours start, but because today is the start of the holiday season there is suddenly
twice the amount of data and the process is still running when peak-hours begin. OK, just wait 30-
60 minutes and let it complete...
Twiddle thumbs...
Respond to emails from upset users: "Don't worry. It will be resolved soon."...
Nope. An hour later and it still hasn't completed. How much longer could it possibly take?
Management makes the call - kill the job and we'll load it tomorrow. OK, you kill it. Hmmm...
performance is still bad. What's going on?
Now that huge transaction you just cancelled is rolling itself back! Now, all you can do is wait. The
system could very well be ground to a halt for the entire day. How can this be avoided?

Breaking it Down
The best way to manage operations on large data sets like this is to break them down into smaller
pieces and perform the operation piece-by-piece. To demonstrate how this can be done, let's
imagine that a daily ETL feed of millions of rows is placed into a staging table named
"new_purchase" and that our operation needs to copy all of these rows into the "purchase" table.
Then let's assume that inserting these rows will take a long time because the "purchase" table
already has millions of rows in it, has numerous indexes on it, and is actively used by the
production application.
So we have some data to test our solution with, I will first create a "new_purchase" table and
populate it with a list of fake purchases:
CREATE TABLE new_purchase (
purchase_date DATETIME,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2)
)
-- populate a fake table with 100,000 new purchases
SET NOCOUNT ON
DECLARE @i AS INT
SET @i=0
WHILE @i< 100000
BEGIN
INSERT new_purchase
SELECT Cast('2009-01-01' AS DATETIME) + Rand() AS purchase_date,
'SKU'+Cast(Cast(Rand() * 10000000 AS BIGINT) AS VARCHAR(7)) AS
item,
Cast(Rand() * 10 + 1 AS INT) AS quantity,
Cast(Rand() * 100 + .01 AS DECIMAL(10,2)) AS price
SET @i = @i + 1
END
Second, I will create a "purchase" table, which is the table that holds the cumulative list of all
purchases in the production system:
--Very large table that contains all purchases and is used by the
production system
CREATE TABLE purchase (
id UNIQUEIDENTIFIER,
created_date DATETIME,
purchase_date datetime,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2),
total DECIMAL(14,2)
)
And finally, I will create a stored procedure to copy one to twenty thousand rows at a time from
the "new_purchase" table into the "purchase" table. I will also create a log table to track and
measure each of the copy operations.
--Create a table for tracking
CREATE TABLE load_new_purchase_log (
start_time DATETIME,
end_time DATETIME,
row_count INT
)
GO
--Create a procedure to load new purchases in groups of 1,000-20,000
rows
CREATE PROCEDURE load_new_purchase
AS
BEGIN

DECLARE @start_time AS DATETIME


DECLARE @maximum_row AS BIGINT
DECLARE @starting_row AS BIGINT
DECLARE @rows_to_process_per_iteration AS INT
DECLARE @rows_added AS INT
DECLARE @message AS VARCHAR(100)

--Define how many rows to process in each iteration


--depending on whether the routine is running during peak hours or
not
SET @rows_to_process_per_iteration =
CASE WHEN DatePart(Hour, GetDate()) BETWEEN 7 AND 17 THEN 1000
--Peak hours
ELSE 20000 END --Off-Peak hours, load a larger number of records
at a time
--Determine how many rows need to be processed in total
SET @maximum_row = (SELECT Count(*) FROM new_purchase)

--If this process had already run, it may have already added some
rows.
--Determine where to start.
SET @starting_row = IsNull((SELECT Sum(row_count) + 1
FROM load_new_purchase_log),1)

--Continue looping until all new_purchase records have been


processed
SET NOCOUNT OFF
WHILE @starting_row <= @maximum_row
BEGIN
SET @message = 'Processing next '+
Cast(@rows_to_process_per_iteration AS VARCHAR(100))+
' records starting at record '+
Cast(@starting_row AS VARCHAR(100))+' (of '+
Cast(@maximum_row AS VARCHAR(100))+')'
PRINT @message

SET @start_time = GetDate()


--Insert the next set of rows and log it to the log table in the
same transaction
BEGIN TRANSACTION
--Copy rows into purchase table from the new_purchase table
INSERT purchase
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as
total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS
rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1

--Log the records that have been added


SET @rows_added = @@RowCount
INSERT load_new_purchase_log (start_time,end_time,row_count)
VALUES (@start_time, GetDate(), @rows_added)
COMMIT TRANSACTION
SET @starting_row = @starting_row + @rows_added

--Define how many rows to process in each iteration


--depending whether the routine is running during peak hours or
not
IF DatePart(Hour, GetDate()) BETWEEN 7 and 17
BEGIN
--Peak Hours
--Load a small number of rows for each iteration
SET @rows_to_process_per_iteration = 1000
--Delay 10 seconds to lighten the load even further
WAITFOR DELAY '00:00:10'
END
ELSE
BEGIN
--Off-Peak Hours
--Load a large number of rows for each iteration
SET @rows_to_process_per_iteration = 20000
END

END --end while statement

END --end stored proc


The main trick in the code is to use the Row_Number() function that is available in version 2005
and later. This enables you to segment the large operation in to smaller operations, each of which
operates only on a sub-set of all of the rows. Note that SQL Server does not support
Row_Number() calculations within the WHERE clause. So, the following would produce an error:
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS
rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1
That is why the stored procedure has to compute the Row_Number() in a sub-select statement.
Even though there is this large SELECT statement in the sub-select, SQL Server performs
surprisingly well at filtering out just the requested records; this is true especially if the source table
has an index with the same ordering that is in the ORDER BY in the OVER clause.
The number of rows copied in each iteration is written to a log table. This enables the procedure
to restart where it left off in the case that it is terminated before it can process all of the rows. The
main INSERT statement and the INSERT of a row to the log table are included in the same
TRANSACTION to ensure that they are COMMITTED to the database as one transaction. A
secondary benefit of writing to a log table is that it is helpful for reporting on the progress of the
operation and for benchmarking performance.
You will also notice a couple of places in this code where it tests to see if the current time is
during peak-hours (between 7AM and 5PM). If it is then the routine only loads 1,000 rows and
then pauses for 30 seconds for each iteration. If the current time is outside of peak hours, then
the routine loads up to 20,000 rows per iteration because there are more system resources
available for processing.
Now, to test the stored procedure I type:
EXEC load_new_purchase
And the following output is generated:
Processing next 1000 records starting at record 1 (of 100000)
Processing next 1000 records starting at record 1001 (of 100000)
Processing next 1000 records starting at record 2001 (of 100000)
I then disconnect that query to test its ability to stop and start with ease. Then, rerun the stored
procedure and let it complete. Running the following queries shows that all 100,000 rows were
copied correctly to the "purchase" table.
SELECT Count(*) as record_count_in_purchase_table FROM purchase
SELECT TOP 10 * FROM purchase
SELECT TOP 10 * FROM load_new_purchase_log

Wrap-Up
In situations where system performance is important and one needs to operate on a large number
of rows, breaking the operation down into smaller pieces has many benefits, including:

1. Fewer database locks and less contention.


2. Better performance for any other applications that share the affected tables.
3. Adding robustness to the operation so that it can easily and efficiently be stopped and
restarted at will.
4. Visibility into the progress of the operation and automatic tracking of metrics that can then
be used for benchmarking.

Você também pode gostar