Data Archiving/Removal/Strucutre - Move Data from Table to File and Replace with URI

Applicability:

                 SQL Server 2000:        Tested
                 SQL Server 2005:        Tested
                 SQL Server 2008:        Not Tested
                 SQL Server 2008R2:    Not Tested
                 SQL Server 2012:        Not Tested        

Credits:

Author:    ChillyDBA
Date:        29 Jul 2007

Description

Several years ago, I was presented with a mass data relocation challenge that needed to be performed on a 24*7 system.
 
The company concerned had, in its early days, made a data design decision that was now coming back to bite them hard.  They had decided to store their bulk mailing information in the OLTP database, and to retain full mailing history for CRM purposes.  A valid decision when the customer base was small and shipping volumes were low, but with daily emails approaching 30,000, the system was starting to creak.  To compound the problem, not only was the text of the email stored, but the full 15KB HTML (including static images) was also streamed to another column in the table.
 
The challenge was to extract and store the email text and html components into a network file share, and to replace the data with a URI location string.
 
The task was broken down into several discrete steps:
  1. Identify the appropriate method for extracting and saving the emails into files.
  2. Change the application to put new emails in the same file based storage and to insert a URI in the previous columns.
  3. Extract all emails up to point #2 to file
  4. Backfill all table entries up to point #2 with URI data
  5. Recover any free space in the table
 

Step #1 - Identify Method

I did consider using DTS/SSIS to loop through all files and perform the extraction, but this seemed to be a bit like using a sledgehammer to crack a nut.  This task were going to have to be batched and monitored closely to avoid interfering with normal production activities, and it seemed like the BCP command line suited this better. 
 
I chose to invest time in producing a parameterised script that would produce BCP batch scripts as a results set (batch script #1 below). 
These batches could then be run from the command line by and administrator without the need to use xp_cmdshell or worry about the complexities of DTS/SSIS proxy account, parameter and logging configuration.
 
A specific file name generation algorithm was developed (function definitions below) that would be used by both the migration process and the ongoing storage by the application.  This assumed a new folder for each day, and that file names would include the PK and the date that the email was queued for sending.

Step #2 - Change Application

This one was purely down to the development team.
Once the application change had been rolled out, no more emails would be stored in the table and an endpoint for the export and backfill process would be known.
 

Step #3 - Extract Emails to file

The batch generator was run to generate batches to extract 1 months worth of emails each. 
The scope was from day #1 up to the point that step #2 was implemented. 
The batches were then manually run in sequence.

Step #4 - Replace Email with URI in table

 Once all files were confirmed as extracted, a modified version of the batch generator was used to overwrite the table version of the email with a URI  that the application could use to access the email (batch script #2 below)

Step #5 - Recover unused space in table

An index rebuild was tried, but this still left unused space in the table rows, so ultimately the table was recreated from scratch and data copied across.  This was encapsulated in a transaction and run during a time of lower impact (when shipping was closed)
 
 

Points to note:

This code is written with a specific table and column set in mind, but can easily be adapted for any table, or even just to archive data to offline files in batches.  The basic table structure is:
 
  EmailQueue
        EmailQueueID INT                         (the PK for the table
        UserID             INT                        (the PK for the customer to whom the message applies)
        DateQueued    DATETIME              (date the email was added to the queue)
        Textmessage  VARCHAR(1000)     (text of the message body)
        HTMLMessage TEXT                      (entire HTML email)
 
I had originally saved this code in its working state, but obviously there needed to be some parameterisation and sanitizing of the code to keep anonymity of the company, so I created a test table with some data which is included as the last code block below, but the actual implementation code only contains <Tags> where values should be inserted as your server and folder names will be different.

Code

 Function  #1 - Construct File Path:

DROP FUNCTION dbo.udf_GetDailyFolderPath
GO

CREATE FUNCTION dbo.udf_GetDailyFolderPath
  
(
      
@BaseFolderPath VARCHAR (100),
      
@Date DATETIME
  
)
RETURNS VARCHAR(200)

AS
BEGIN
   RETURN
      
@BaseFolderPath
      
+ '\'
      
+ CONVERT(VARCHAR(4), DATEPART(yy,@Date))
       +
'_'
      
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mm,@Date)), 2)

       +
'_'
      
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(dd,@Date)), 2)
END


-- select dbo.udf_GetDailyFolderPath ('D:\EmailStorage\', GETDATE())

 
 

 Function  #2 - Construct Record Identifier:

DROP FUNCTION dbo.udf_ConstructFileIdentifier
GO

CREATE FUNCTION dbo.udf_ConstructFileIdentifier
  
(
      
@PKID INT,
      
@Date DATETIME
  
)
RETURNS VARCHAR(200)

AS
BEGIN
   RETURN
          
'_'
          
+ CONVERT(VARCHAR(10), @PKID)
           +
'_'
          
+ CONVERT(VARCHAR(4), DATEPART(yy,@Date))
           +
RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mm,@Date)), 2)
           +
RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(dd,@Date)), 2)
           +
'_'
          
+ RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(hh,@Date)), 2)
           +
RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(mi,@Date)), 2)
           +
RIGHT('00' + CONVERT(VARCHAR(2), DATEPART(ss,@Date)), 2)
           +
'.'
          
+ RIGHT('0000' + CONVERT(VARCHAR(4), DATEPART(ms,@Date)), 4)
END


-- select dbo.udf_ConstructFileIdentifier (123456, GETDATE())

 
 

 Batch Script #1 - Create data export batch files:

/********************************************************
NOTE:  The query hint (NOLOCK) is used throughout as
in this case, the email messages processed will all
have been finalised and not subject to further update, so a
non-repeatable read is OK.  It reduces the risk of
locking during the batch processing.
*********************************************************/

SET NOCOUNT ON

DECLARE
  
@StartDate          DATETIME,
  
@EndDate            DATETIME,
  
@LoopEndDate        DATETIME,
  
@EmailQueueID       INT,
  
@DateQueued         DATETIME,
  
@UserID             INT,
  
@ServerName         VARCHAR(100),
  
@BaseFolderPath     VARCHAR(100),
  
@FolderPath         VARCHAR(200),
  
@FileIdentifier     VARCHAR(200),
  
@DelCommand         VARCHAR(8000),
  
@CreateCommand      VARCHAR(8000),
  
@TableName          VARCHAR(100),
  
@ColumnName         VARCHAR(100),
  
@FileSuffix         VARCHAR(100)
  


-- table to contain the list of commands for the batch
-- will include standard DOS folder manipulation commands plus BCP commands
CREATE TABLE #Commands
(
CommandID INT IDENTITY(1,1),
Command    VARCHAR(512)
)


/**********************************************************
           Set all required variables here
***********************************************************/
SELECT    
  
@StartDate          = <BatchStartDate>,          -- '1 Jan 2007'
  
@EndDate            = <BatchEndDate>,            -- '1 Feb 2007'
  
@ServerName         = <ServerName>,              -- 'MYSERVER'
  
@BaseFolderPath     = <BaseFolderPath>,          -- '\\MYSERVER\D$\EmailStorage'
  
  
@TableName          = <TableName>,               -- 'EmailQueue'   ***  MUST BE a 3-part name i.e.  DB.Schema.Table  ***
  
@ColumnName         = <ColumnName>,              -- 'TextMessage'
  
@FileSuffix         = <FileSuffix>,              -- '.txt'
  
  
@LoopEndDate        = DATEADD(dd,1,@StartDate)


-- iterate through the time period by day
WHILE @LoopEndDate <= @EndDate
BEGIN
   SELECT  
@FolderPath = dbo.udf_GetDailyFolderPath(@BaseFolderPath,@StartDate)

  
-- assume that the batch may need to be rerun or that the folder may not be present
   -- so we need to drop and recreate the folder
  
INSERT #Commands (Command)
  
SELECT 'Del "' + @FolderPath + '" /Q'
  
  
INSERT #Commands (Command)
  
SELECT 'mkdir "' + @FolderPath + '"'
  
  
-- iterate through all of the messages in the batch scope
  
SELECT @EmailQueueID = MIN(EmailQueueID)
  
FROM EmailQueue (NOLOCK)
  
WHERE DateQueued >= @StartDate AND DateQueued < @LoopEndDate

  
  
WHILE @EmailQueueID IS NOT NULL
  
BEGIN
       SELECT  
          
@DateQueued = DateQueued,
          
@UserID = UserID          -- this is the PK for the customer to whom the message was sent and will be used as the primary
                                       -- way of accessing messages for CRM purposes
      
FROM EmailQueue (NOLOCK)
      
WHERE EmailQueueID = @EmailQueueID
  

      
INSERT #Commands (Command)
      
SELECT
          
'BCP "SELECT '
          
+ @Columnname
          
+ ' FROM '
          
+ @TableName
          
+ ' (NOLOCK) WHERE EmailQueueID =' + CONVERT(VARCHAR(10), @EmailQueueID)
           +
' " QUERYOUT "'
          
+ @FolderPath
          
+ '\EmailQueue_'
          
+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)
           +
@FileSuffix
          
+ '" -c -b1000 -t \n -r , -S'
          
+ @ServerName
          
+ ' -T'
  
      
SELECT @EmailQueueID = MIN(EmailQueueID)
      
FROM EmailQueue (NOLOCK)
      
WHERE DateQueued >= @StartDate AND DateQueued < @LoopEndDate
      
AND EmailQueueID > @EmailQueueID
  
END

   SELECT  
      
@StartDate = DATEADD(dd,1,@StartDate),
      
@LoopEndDate = DATEADD(dd,1,@LoopEndDate)
END

-- now return all the commands in order
-- these must be saved to a .bat file ready for running later
SELECT Command
FROM #Commands
ORDER BY CommandID


DROP TABLE #Commands

SET NOCOUNT OFF

 
 

 Batch Script #2 - Update columns to URI values:

SET NOCOUNT ON

DECLARE
  
@StartDate          DATETIME,
  
@EndDate            DATETIME,
  
@LoopEndDate        DATETIME,
  
@EmailQueueID       INT,
  
@DateQueued         DATETIME,
  
@UserID             INT,
  
@ServerName         VARCHAR(100),
  
@BaseFolderPath     VARCHAR(100),
  
@FolderPath         VARCHAR(200),
  
@FileIdentifier     VARCHAR(200),
  
@DelCommand         VARCHAR(8000),
  
@CreateCommand      VARCHAR(8000),
  
@TableName          VARCHAR(100),
  
@ColumnName         VARCHAR(100),
  
@TXTFileSuffix      VARCHAR(100),
  
@HTMLFileSuffix     VARCHAR(100),
  
@TXTURI             VARCHAR(125),
  
@HTMLURI            VARCHAR(125)



/**********************************************************
           Set all required variables here
***********************************************************/
SELECT    
  
@StartDate          = <BatchStartDate>,          -- '1 Jan 2007'
  
@EndDate            = <BatchEndDate>,            -- '1 Feb 2007'
  
@ServerName         = <ServerName>,              -- 'MYSERVER'
  
@BaseFolderPath     = <BaseFolderPath>,          -- '\\MYSERVER\D$\EmailStorage'
  
  
@TableName          = <TableName>,               -- 'EmailQueue'
  
@ColumnName         = <ColumnName>,              -- 'TextMessage'
  
@TXTFileSuffix      = <FileSuffix>,              -- '.txt'
  
@HTMLFileSuffix     = <FileSuffix>,              -- '.html'
  
  
@LoopEndDate        = DATEADD(dd,1,@StartDate)
  

SELECT     @FolderPath = dbo.udf_GetDailyFolderPath(@BaseFolderPath,@StartDate)

-- loop through all records within the batch date limits
-- use the table PK, as the User PK is irrelevant as we aren't accessing content by customer
SELECT  @EmailQueueID = MIN(EmailQueueID)
FROM EmailQueue(NOLOCK)
WHERE DateQueued >= @StartDate
AND DateQueued < @EndDate

  
WHILE @EmailQueueID IS NOT NULL
BEGIN
   SELECT  
      
@DateQueued = DateQueued,
      
@UserID = UserID
  
FROM EmailQueue (NOLOCK)
  
WHERE EmailQueueID = @EmailQueueID

  
SELECT
      
@TXTURI =
          
@FolderPath
          
+ '\'
          
+ @TableName
          
+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)
           +
@TXTFileSuffix ,
      
@HTMLURI =
          
@FolderPath
          
+ '\'
          
+ @TableName
          
+ dbo.udf_ConstructFileIdentifier (@UserID, @DateQueued)
           +
@TXTFileSuffix
  
FROM EmailQueue (NOLOCK)
  
WHERE EmailQueueID = @EmailQueueID

  
UPDATE EmailQueue
  
SET TextMessage = @TXTURI,
      
HTMLMessage = @HTMLURI
  
WHERE EmailQueueID = @EmailQueueID


  
SELECT @EmailQueueID = MIN(EmailQueueID)
  
FROM EmailQueue(NOLOCK)
  
WHERE DateQueued >= @StartDate
  
AND DateQueued < @EndDate
  
AND EmailQueueID > @EmailQueueID

END

SET NOCOUNT OFF

 
 

 Test Script - Create test table and data:

DROP TABLE [emailqueue]
go

CREATE TABLE [dbo].[emailqueue] (
  
[emailQueueId] INT IDENTITY (1, 1) NOT NULL ,
  
[userid] INT NOT NULL ,
  
[textMessage] TEXT,
  
[htmlMessage] TEXT,
  
datequeued DATETIME)
  
GO
  
INSERT emailqueue (userid, textMessage, [htmlMessage], datequeued)
SELECT 1, 'test message 1', 'html test mesage 1', '1 jun 2012'
UNION
SELECT
1, 'test message 2', 'html test mesage 2', '15 jun 2012'
UNION
SELECT
2, 'test message 3', 'html test mesage 3', '21 jun 2012'
UNION
SELECT
99, 'test message 4', 'html test mesage 4', '1 jul 2012'
UNION
SELECT
3, 'test message 5', 'html test mesage 5', '2 jul 2012'
UNION
SELECT
15, 'test message 6', 'html test mesage 6', '3 jul 2012'
UNION
SELECT
4, 'test message 7', 'html test mesage 7', '4 jul 2012'
UNION
SELECT
77, 'test message 8', 'html test mesage 8', '1 aug 2012'
UNION
SELECT
100, 'test message 9', 'html test mesage 9', '2 aug 2012'
UNION
SELECT
1111, 'test message 10', 'html test mesage 10', '3 aug 2012'
   
  

 
ċ
Move column from table to file storage - create test data.sql
(1k)
Andy Hughes,
Jun 13, 2012, 10:55 AM
ċ
Move column from table to file storage - part 1 - create batch extract commands.sql
(3k)
Andy Hughes,
Jun 13, 2012, 10:55 AM
ċ
Move column from table to file storage - part 2 - replace table data with URI.sql
(2k)
Andy Hughes,
Jun 13, 2012, 10:55 AM
ċ
Move column from table to file storage - udf_ConstructFileIdentifier.sql
(1k)
Andy Hughes,
Jun 13, 2012, 10:56 AM
ċ
Move column from table to file storage - udf_GetDailyFolderPath.sql
(0k)
Andy Hughes,
Jun 13, 2012, 10:56 AM
Comments