Move Data from Table to File and Replace with URI

Applicability:

SQL Server 2000: Tested

SQL Server 2005: Tested

SQL Server 2008: Not Tested

SQL Server 2008R2: Not Tested

SQL Server 2012: Not Tested

Credits:

Author: ChillyDBA

Date: 29 Jul 2007

Description

Several years ago, I was presented with a mass data relocation challenge that needed to be performed on a 24*7 system.

The company concerned had, in its early days, made a data design decision that was now coming back to bite them hard. They had decided to store their bulk mailing information in the OLTP database, and to retain full mailing history for CRM purposes. A valid decision when the customer base was small and shipping volumes were low, but with daily emails approaching 30,000, the system was starting to creak. To compound the problem, not only was the text of the email stored, but the full 15KB HTML (including static images) was also streamed to another column in the table.

The challenge was to extract and store the email text and html components into a network file share, and to replace the data with a URI location string.

The task was broken down into several discrete steps:

    1. Identify the appropriate method for extracting and saving the emails into files.

    2. Change the application to put new emails in the same file based storage and to insert a URI in the previous columns.

    3. Extract all emails up to point #2 to file

    4. Backfill all table entries up to point #2 with URI data

    5. Recover any free space in the table

Step #1 - Identify Method

I did consider using DTS/SSIS to loop through all files and perform the extraction, but this seemed to be a bit like using a sledgehammer to crack a nut. This task were going to have to be batched and monitored closely to avoid interfering with normal production activities, and it seemed like the BCP command line suited this better.

I chose to invest time in producing a parameterised script that would produce BCP batch scripts as a results set (batch script #1 below).

These batches could then be run from the command line by and administrator without the need to use xp_cmdshell or worry about the complexities of DTS/SSIS proxy account, parameter and logging configuration.

A specific file name generation algorithm was developed (function definitions below) that would be used by both the migration process and the ongoing storage by the application. This assumed a new folder for each day, and that file names would include the PK and the date that the email was queued for sending.

Step #2 - Change Application

This one was purely down to the development team.

Once the application change had been rolled out, no more emails would be stored in the table and an endpoint for the export and backfill process would be known.

Step #3 - Extract Emails to file

The batch generator was run to generate batches to extract 1 months worth of emails each.

The scope was from day #1 up to the point that step #2 was implemented.

The batches were then manually run in sequence.

Step #4 - Replace Email with URI in table

Once all files were confirmed as extracted, a modified version of the batch generator was used to overwrite the table version of the email with a URI that the application could use to access the email (batch script #2 below)

Step #5 - Recover unused space in table

An index rebuild was tried, but this still left unused space in the table rows, so ultimately the table was recreated from scratch and data copied across. This was encapsulated in a transaction and run during a time of lower impact (when shipping was closed)

Points to note:

This code is written with a specific table and column set in mind, but can easily be adapted for any table, or even just to archive data to offline files in batches. The basic table structure is:

EmailQueue

DateQueued DATETIME (date the email was added to the queue)

Textmessage VARCHAR(1000) (text of the message body)

HTMLMessage TEXT (entire HTML email)

I had originally saved this code in its working state, but obviously there needed to be some parameterisation and sanitizing of the code to keep anonymity of the company, so I created a test table with some data which is included as the last code block below, but the actual implementation code only contains <Tags> where values should be inserted as your server and folder names will be different.

Code

Function #1 - Construct File Path:

EmailQueueID INT (the PK for the table

UserID INT (the PK for the customer to whom the message applies)

DROP FUNCTION dbo.udf_GetDailyFolderPath

GO

CREATE FUNCTION dbo.udf_GetDailyFolderPath

(

@BaseFolderPath VARCHAR (100),

@Date DATETIME

)

RETURNS VARCHAR(200)

AS

BEGIN

RETURN

@BaseFolderPath

+ '\'

+ CONVERT(VARCHAR(4), DATEPART(yy,@Date))

+ '_'

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mm,@Date)), 2)

+ '_'

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(dd,@Date)), 2)

END

-- select dbo.udf_GetDailyFolderPath ('D:\EmailStorage\', GETDATE())

Function #2 - Construct Record Identifier:

DROP FUNCTION dbo.udf_ConstructFileIdentifier

GO

CREATE FUNCTION dbo.udf_ConstructFileIdentifier

(

@PKID INT,

@Date DATETIME

)

RETURNS VARCHAR(200)

AS

BEGIN

RETURN

'_'

+ CONVERT(VARCHAR(10), @PKID)

+ '_'

+ CONVERT(VARCHAR(4), DATEPART(yy,@Date))

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mm,@Date)), 2)

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(dd,@Date)), 2)

+ '_'

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(hh,@Date)), 2)

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mi,@Date)), 2)

+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(ss,@Date)), 2)

+ '.'

+ RIGHT('0000' + CONVERT(VARCHAR(4), DATEPART(ms,@Date)), 4)

END

-- select dbo.udf_ConstructFileIdentifier (123456, GETDATE())

Batch Script #1 - Create data export batch files:

/********************************************************

NOTE: The query hint (NOLOCK) is used throughout as

in this case, the email messages processed will all

have been finalised and not subject to further update, so a

non-repeatable read is OK. It reduces the risk of

locking during the batch processing.

*********************************************************/

SET NOCOUNT ON

DECLARE

@StartDate DATETIME,

@EndDate DATETIME,

@LoopEndDate DATETIME,

@EmailQueueID INT,

@DateQueued DATETIME,

@UserID INT,

@ServerName VARCHAR(100),

@BaseFolderPath VARCHAR(100),

@FolderPath VARCHAR(200),

@FileIdentifier VARCHAR(200),

@DelCommand VARCHAR(8000),

@CreateCommand VARCHAR(8000),

@TableName VARCHAR(100),

@ColumnName VARCHAR(100),

@FileSuffix VARCHAR(100)

-- table to contain the list of commands for the batch

-- will include standard DOS folder manipulation commands plus BCP commands

CREATE TABLE #Commands

(

CommandID INT IDENTITY(1,1),

Command VARCHAR(512)

)

/**********************************************************

Set all required variables here

***********************************************************/

SELECT

@StartDate = <BatchStartDate>, -- '1 Jan 2007'

@EndDate = <BatchEndDate>, -- '1 Feb 2007'

@ServerName = <ServerName>, -- 'MYSERVER'

@BaseFolderPath = <BaseFolderPath>, -- '\\MYSERVER\D$\EmailStorage'

@TableName = <TableName>, -- 'EmailQueue' *** MUST BE a 3-part name i.e. DB.Schema.Table ***

@ColumnName = <ColumnName>, -- 'TextMessage'

@FileSuffix = <FileSuffix>, -- '.txt'

@LoopEndDate = DATEADD(dd,1,@StartDate)

-- iterate through the time period by day

WHILE @LoopEndDate <= @EndDate

BEGIN

SELECT @FolderPath = dbo.udf_GetDailyFolderPath(@BaseFolderPath,@StartDate)

-- assume that the batch may need to be rerun or that the folder may not be present

-- so we need to drop and recreate the folder

INSERT #Commands (Command)

SELECT 'Del "' + @FolderPath + '" /Q'

INSERT #Commands (Command)

SELECT 'mkdir "' + @FolderPath + '"'

-- iterate through all of the messages in the batch scope

SELECT @EmailQueueID = MIN(EmailQueueID)

FROM EmailQueue (NOLOCK)

WHERE DateQueued >= @StartDate AND DateQueued < @LoopEndDate

WHILE @EmailQueueID IS NOT NULL

BEGIN

SELECT

@DateQueued = DateQueued,

@UserID = UserID -- this is the PK for the customer to whom the message was sent and will be used as the primary

-- way of accessing messages for CRM purposes

FROM EmailQueue (NOLOCK)

WHERE EmailQueueID = @EmailQueueID

INSERT #Commands (Command)

SELECT

'BCP "SELECT '

+ @Columnname

+ ' FROM '

+ @TableName

+ ' (NOLOCK) WHERE EmailQueueID =' + CONVERT(VARCHAR(10), @EmailQueueID)

+ ' " QUERYOUT "'

+ @FolderPath

+ '\EmailQueue_'

+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)

+ @FileSuffix

+ '" -c -b1000 -t \n -r , -S'

+ @ServerName

+ ' -T'

SELECT @EmailQueueID = MIN(EmailQueueID)

FROM EmailQueue (NOLOCK)

WHERE DateQueued >= @StartDate AND DateQueued < @LoopEndDate

AND EmailQueueID > @EmailQueueID

END

SELECT

@StartDate = DATEADD(dd,1,@StartDate),

@LoopEndDate = DATEADD(dd,1,@LoopEndDate)

END

-- now return all the commands in order

-- these must be saved to a .bat file ready for running later

SELECT Command

FROM #Commands

ORDER BY CommandID

DROP TABLE #Commands

SET NOCOUNT OFF

Batch Script #2 - Update columns to URI values:

SET NOCOUNT ON

DECLARE

@StartDate DATETIME,

@EndDate DATETIME,

@LoopEndDate DATETIME,

@EmailQueueID INT,

@DateQueued DATETIME,

@UserID INT,

@ServerName VARCHAR(100),

@BaseFolderPath VARCHAR(100),

@FolderPath VARCHAR(200),

@FileIdentifier VARCHAR(200),

@DelCommand VARCHAR(8000),

@CreateCommand VARCHAR(8000),

@TableName VARCHAR(100),

@ColumnName VARCHAR(100),

@TXTFileSuffix VARCHAR(100),

@HTMLFileSuffix VARCHAR(100),

@TXTURI VARCHAR(125),

@HTMLURI VARCHAR(125)

/**********************************************************

Set all required variables here

***********************************************************/

SELECT

@StartDate = <BatchStartDate>, -- '1 Jan 2007'

@EndDate = <BatchEndDate>, -- '1 Feb 2007'

@ServerName = <ServerName>, -- 'MYSERVER'

@BaseFolderPath = <BaseFolderPath>, -- '\\MYSERVER\D$\EmailStorage'

@TableName = <TableName>, -- 'EmailQueue'

@ColumnName = <ColumnName>, -- 'TextMessage'

@TXTFileSuffix = <FileSuffix>, -- '.txt'

@HTMLFileSuffix = <FileSuffix>, -- '.html'

@LoopEndDate = DATEADD(dd,1,@StartDate)

SELECT @FolderPath = dbo.udf_GetDailyFolderPath(@BaseFolderPath,@StartDate)

-- loop through all records within the batch date limits

-- use the table PK, as the User PK is irrelevant as we aren't accessing content by customer

SELECT @EmailQueueID = MIN(EmailQueueID)

FROM EmailQueue(NOLOCK)

WHERE DateQueued >= @StartDate

AND DateQueued < @EndDate

WHILE @EmailQueueID IS NOT NULL

BEGIN

SELECT

@DateQueued = DateQueued,

@UserID = UserID

FROM EmailQueue (NOLOCK)

WHERE EmailQueueID = @EmailQueueID

SELECT

@TXTURI =

@FolderPath

+ '\'

+ @TableName

+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)

+ @TXTFileSuffix ,

@HTMLURI =

@FolderPath

+ '\'

+ @TableName

+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)

+ @TXTFileSuffix

FROM EmailQueue (NOLOCK)

WHERE EmailQueueID = @EmailQueueID

UPDATE EmailQueue

SET TextMessage = @TXTURI,

HTMLMessage = @HTMLURI

WHERE EmailQueueID = @EmailQueueID

SELECT @EmailQueueID = MIN(EmailQueueID)

FROM EmailQueue(NOLOCK)

WHERE DateQueued >= @StartDate

AND DateQueued < @EndDate

AND EmailQueueID > @EmailQueueID

END

SET NOCOUNT OFF

Test Script - Create test table and data:

DROP TABLE [emailqueue]

go

CREATE TABLE [dbo].[emailqueue] (

[emailQueueId] INT IDENTITY (1, 1) NOT NULL ,

[userid] INT NOT NULL ,

[textMessage] TEXT,

[htmlMessage] TEXT,

datequeued DATETIME)

GO

INSERT emailqueue (userid, textMessage, [htmlMessage], datequeued)

SELECT 1, 'test message 1', 'html test mesage 1', '1 jun 2012'

UNION

SELECT 1, 'test message 2', 'html test mesage 2', '15 jun 2012'

UNION

SELECT 2, 'test message 3', 'html test mesage 3', '21 jun 2012'

UNION

SELECT 99, 'test message 4', 'html test mesage 4', '1 jul 2012'

UNION

SELECT 3, 'test message 5', 'html test mesage 5', '2 jul 2012'

UNION

SELECT 15, 'test message 6', 'html test mesage 6', '3 jul 2012'

UNION

SELECT 4, 'test message 7', 'html test mesage 7', '4 jul 2012'

UNION

SELECT 77, 'test message 8', 'html test mesage 8', '1 aug 2012'

UNION

SELECT 100, 'test message 9', 'html test mesage 9', '2 aug 2012'

UNION

SELECT 1111, 'test message 10', 'html test mesage 10', '3 aug 2012'