Move Data from Table to File and Replace with URI
Applicability:
SQL Server 2000: Tested
SQL Server 2005: Tested
SQL Server 2008: Not Tested
SQL Server 2008R2: Not Tested
SQL Server 2012: Not Tested
Credits:
Author: ChillyDBA
Date: 29 Jul 2007
Description
Several years ago, I was presented with a mass data relocation challenge that needed to be performed on a 24*7 system.
The company concerned had, in its early days, made a data design decision that was now coming back to bite them hard. They had decided to store their bulk mailing information in the OLTP database, and to retain full mailing history for CRM purposes. A valid decision when the customer base was small and shipping volumes were low, but with daily emails approaching 30,000, the system was starting to creak. To compound the problem, not only was the text of the email stored, but the full 15KB HTML (including static images) was also streamed to another column in the table.
The challenge was to extract and store the email text and html components into a network file share, and to replace the data with a URI location string.
The task was broken down into several discrete steps:
Identify the appropriate method for extracting and saving the emails into files.
Change the application to put new emails in the same file based storage and to insert a URI in the previous columns.
Extract all emails up to point #2 to file
Backfill all table entries up to point #2 with URI data
Recover any free space in the table
Step #1 - Identify Method
I did consider using DTS/SSIS to loop through all files and perform the extraction, but this seemed to be a bit like using a sledgehammer to crack a nut. This task were going to have to be batched and monitored closely to avoid interfering with normal production activities, and it seemed like the BCP command line suited this better.
I chose to invest time in producing a parameterised script that would produce BCP batch scripts as a results set (batch script #1 below).
These batches could then be run from the command line by and administrator without the need to use xp_cmdshell or worry about the complexities of DTS/SSIS proxy account, parameter and logging configuration.
A specific file name generation algorithm was developed (function definitions below) that would be used by both the migration process and the ongoing storage by the application. This assumed a new folder for each day, and that file names would include the PK and the date that the email was queued for sending.
Step #2 - Change Application
This one was purely down to the development team.
Once the application change had been rolled out, no more emails would be stored in the table and an endpoint for the export and backfill process would be known.
Step #3 - Extract Emails to file
The batch generator was run to generate batches to extract 1 months worth of emails each.
The scope was from day #1 up to the point that step #2 was implemented.
The batches were then manually run in sequence.
Step #4 - Replace Email with URI in table
Once all files were confirmed as extracted, a modified version of the batch generator was used to overwrite the table version of the email with a URI that the application could use to access the email (batch script #2 below)
Step #5 - Recover unused space in table
An index rebuild was tried, but this still left unused space in the table rows, so ultimately the table was recreated from scratch and data copied across. This was encapsulated in a transaction and run during a time of lower impact (when shipping was closed)
Points to note:
This code is written with a specific table and column set in mind, but can easily be adapted for any table, or even just to archive data to offline files in batches. The basic table structure is:
EmailQueue
DateQueued DATETIME (date the email was added to the queue)
Textmessage VARCHAR(1000) (text of the message body)
HTMLMessage TEXT (entire HTML email)
I had originally saved this code in its working state, but obviously there needed to be some parameterisation and sanitizing of the code to keep anonymity of the company, so I created a test table with some data which is included as the last code block below, but the actual implementation code only contains <Tags> where values should be inserted as your server and folder names will be different.
Code
Function #1 - Construct File Path:
EmailQueueID INT (the PK for the table
UserID INT (the PK for the customer to whom the message applies)
DROP FUNCTION dbo.udf_GetDailyFolderPath
GO
CREATE FUNCTION dbo.udf_GetDailyFolderPath
(
@BaseFolderPath VARCHAR (100),
@Date DATETIME
)
RETURNS VARCHAR(200)
AS
BEGIN
RETURN
@BaseFolderPath
+ '\'
+ CONVERT(VARCHAR(4), DATEPART(yy,@Date))
+ '_'
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mm,@Date)), 2)
+ '_'
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(dd,@Date)), 2)
END
-- select dbo.udf_GetDailyFolderPath ('D:\EmailStorage\', GETDATE())
Function #2 - Construct Record Identifier:
DROP FUNCTION dbo.udf_ConstructFileIdentifier
GO
CREATE FUNCTION dbo.udf_ConstructFileIdentifier
(
@PKID INT,
@Date DATETIME
)
RETURNS VARCHAR(200)
AS
BEGIN
RETURN
'_'
+ CONVERT(VARCHAR(10), @PKID)
+ '_'
+ CONVERT(VARCHAR(4), DATEPART(yy,@Date))
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mm,@Date)), 2)
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(dd,@Date)), 2)
+ '_'
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(hh,@Date)), 2)
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mi,@Date)), 2)
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(ss,@Date)), 2)
+ '.'
+ RIGHT('0000' + CONVERT(VARCHAR(4), DATEPART(ms,@Date)), 4)
END
-- select dbo.udf_ConstructFileIdentifier (123456, GETDATE())
Batch Script #1 - Create data export batch files:
/********************************************************
NOTE: The query hint (NOLOCK) is used throughout as
in this case, the email messages processed will all
have been finalised and not subject to further update, so a
non-repeatable read is OK. It reduces the risk of
locking during the batch processing.
*********************************************************/
SET NOCOUNT ON
DECLARE
@StartDate DATETIME,
@EndDate DATETIME,
@LoopEndDate DATETIME,
@EmailQueueID INT,
@DateQueued DATETIME,
@UserID INT,
@ServerName VARCHAR(100),
@BaseFolderPath VARCHAR(100),
@FolderPath VARCHAR(200),
@FileIdentifier VARCHAR(200),
@DelCommand VARCHAR(8000),
@CreateCommand VARCHAR(8000),
@TableName VARCHAR(100),
@ColumnName VARCHAR(100),
@FileSuffix VARCHAR(100)
-- table to contain the list of commands for the batch
-- will include standard DOS folder manipulation commands plus BCP commands
CREATE TABLE #Commands
(
CommandID INT IDENTITY(1,1),
Command VARCHAR(512)
)
/**********************************************************
Set all required variables here
***********************************************************/
SELECT
@StartDate = <BatchStartDate>, -- '1 Jan 2007'
@EndDate = <BatchEndDate>, -- '1 Feb 2007'
@ServerName = <ServerName>, -- 'MYSERVER'
@BaseFolderPath = <BaseFolderPath>, -- '\\MYSERVER\D$\EmailStorage'
@TableName = <TableName>, -- 'EmailQueue' *** MUST BE a 3-part name i.e. DB.Schema.Table ***
@ColumnName = <ColumnName>, -- 'TextMessage'
@FileSuffix = <FileSuffix>, -- '.txt'
@LoopEndDate = DATEADD(dd,1,@StartDate)
-- iterate through the time period by day
WHILE @LoopEndDate <= @EndDate
BEGIN
SELECT @FolderPath = dbo.udf_GetDailyFolderPath(@BaseFolderPath,@StartDate)
-- assume that the batch may need to be rerun or that the folder may not be present
-- so we need to drop and recreate the folder
INSERT #Commands (Command)
SELECT 'Del "' + @FolderPath + '" /Q'
INSERT #Commands (Command)
SELECT 'mkdir "' + @FolderPath + '"'
-- iterate through all of the messages in the batch scope
SELECT @EmailQueueID = MIN(EmailQueueID)
FROM EmailQueue (NOLOCK)
WHERE DateQueued >= @StartDate AND DateQueued < @LoopEndDate
WHILE @EmailQueueID IS NOT NULL
BEGIN
SELECT
@DateQueued = DateQueued,
@UserID = UserID -- this is the PK for the customer to whom the message was sent and will be used as the primary
-- way of accessing messages for CRM purposes
FROM EmailQueue (NOLOCK)
WHERE EmailQueueID = @EmailQueueID
INSERT #Commands (Command)
SELECT
'BCP "SELECT '
+ @Columnname
+ ' FROM '
+ @TableName
+ ' (NOLOCK) WHERE EmailQueueID =' + CONVERT(VARCHAR(10), @EmailQueueID)
+ ' " QUERYOUT "'
+ @FolderPath
+ '\EmailQueue_'
+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)
+ @FileSuffix
+ '" -c -b1000 -t \n -r , -S'
+ @ServerName
+ ' -T'
SELECT @EmailQueueID = MIN(EmailQueueID)
FROM EmailQueue (NOLOCK)
WHERE DateQueued >= @StartDate AND DateQueued < @LoopEndDate
AND EmailQueueID > @EmailQueueID
END
SELECT
@StartDate = DATEADD(dd,1,@StartDate),
@LoopEndDate = DATEADD(dd,1,@LoopEndDate)
END
-- now return all the commands in order
-- these must be saved to a .bat file ready for running later
SELECT Command
FROM #Commands
ORDER BY CommandID
DROP TABLE #Commands
SET NOCOUNT OFF
Batch Script #2 - Update columns to URI values:
SET NOCOUNT ON
DECLARE
@StartDate DATETIME,
@EndDate DATETIME,
@LoopEndDate DATETIME,
@EmailQueueID INT,
@DateQueued DATETIME,
@UserID INT,
@ServerName VARCHAR(100),
@BaseFolderPath VARCHAR(100),
@FolderPath VARCHAR(200),
@FileIdentifier VARCHAR(200),
@DelCommand VARCHAR(8000),
@CreateCommand VARCHAR(8000),
@TableName VARCHAR(100),
@ColumnName VARCHAR(100),
@TXTFileSuffix VARCHAR(100),
@HTMLFileSuffix VARCHAR(100),
@TXTURI VARCHAR(125),
@HTMLURI VARCHAR(125)
/**********************************************************
Set all required variables here
***********************************************************/
SELECT
@StartDate = <BatchStartDate>, -- '1 Jan 2007'
@EndDate = <BatchEndDate>, -- '1 Feb 2007'
@ServerName = <ServerName>, -- 'MYSERVER'
@BaseFolderPath = <BaseFolderPath>, -- '\\MYSERVER\D$\EmailStorage'
@TableName = <TableName>, -- 'EmailQueue'
@ColumnName = <ColumnName>, -- 'TextMessage'
@TXTFileSuffix = <FileSuffix>, -- '.txt'
@HTMLFileSuffix = <FileSuffix>, -- '.html'
@LoopEndDate = DATEADD(dd,1,@StartDate)
SELECT @FolderPath = dbo.udf_GetDailyFolderPath(@BaseFolderPath,@StartDate)
-- loop through all records within the batch date limits
-- use the table PK, as the User PK is irrelevant as we aren't accessing content by customer
SELECT @EmailQueueID = MIN(EmailQueueID)
FROM EmailQueue(NOLOCK)
WHERE DateQueued >= @StartDate
AND DateQueued < @EndDate
WHILE @EmailQueueID IS NOT NULL
BEGIN
SELECT
@DateQueued = DateQueued,
@UserID = UserID
FROM EmailQueue (NOLOCK)
WHERE EmailQueueID = @EmailQueueID
SELECT
@TXTURI =
@FolderPath
+ '\'
+ @TableName
+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)
+ @TXTFileSuffix ,
@HTMLURI =
@FolderPath
+ '\'
+ @TableName
+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)
+ @TXTFileSuffix
FROM EmailQueue (NOLOCK)
WHERE EmailQueueID = @EmailQueueID
UPDATE EmailQueue
SET TextMessage = @TXTURI,
HTMLMessage = @HTMLURI
WHERE EmailQueueID = @EmailQueueID
SELECT @EmailQueueID = MIN(EmailQueueID)
FROM EmailQueue(NOLOCK)
WHERE DateQueued >= @StartDate
AND DateQueued < @EndDate
AND EmailQueueID > @EmailQueueID
END
SET NOCOUNT OFF
Test Script - Create test table and data:
DROP TABLE [emailqueue]
go
CREATE TABLE [dbo].[emailqueue] (
[emailQueueId] INT IDENTITY (1, 1) NOT NULL ,
[userid] INT NOT NULL ,
[textMessage] TEXT,
[htmlMessage] TEXT,
datequeued DATETIME)
GO
INSERT emailqueue (userid, textMessage, [htmlMessage], datequeued)
SELECT 1, 'test message 1', 'html test mesage 1', '1 jun 2012'
UNION
SELECT 1, 'test message 2', 'html test mesage 2', '15 jun 2012'
UNION
SELECT 2, 'test message 3', 'html test mesage 3', '21 jun 2012'
UNION
SELECT 99, 'test message 4', 'html test mesage 4', '1 jul 2012'
UNION
SELECT 3, 'test message 5', 'html test mesage 5', '2 jul 2012'
UNION
SELECT 15, 'test message 6', 'html test mesage 6', '3 jul 2012'
UNION
SELECT 4, 'test message 7', 'html test mesage 7', '4 jul 2012'
UNION
SELECT 77, 'test message 8', 'html test mesage 8', '1 aug 2012'
UNION
SELECT 100, 'test message 9', 'html test mesage 9', '2 aug 2012'
UNION
SELECT 1111, 'test message 10', 'html test mesage 10', '3 aug 2012'