5 minute read

How to use JSON strings to detail the data transformations that have been applied to a record using T-SQL.

Overview

ETL is a big topic, but one critical aspect is to record the transformations that have been done to a piece of data.

When performing etl (elt, let, etc) in traditional databases, it is quite common to record the “load date” in one column, the “stage date” in another, and so on. Those columns quickly add up, and no matter how many columns you created there is always something else that in hindsight really ought to have been recorded.

Even in traditional databases, however, tracing the data lineage across transformations can quickly get quite complex. One datum may end up in a completely different place from its original column and row in a table.

In this context, recording the transformation history in json format is a handy option that can travel with the data wherever it goes. Json format is human-readable and if strict layout as rows and columns is wanted at a later stage, that can still be done with data stored in json format.

We can walk through a transact-sql example with SQL Server.

Core method

The heart of this approach is that once we have a json, we can modify it repeatedly to record the transformations that have been perfomed on the data.

The Microsoft documentations for json_modify gives a number of options, such as append, etc.

The example given here utilises the simplest option of all:

DECLARE @json nvarchar(max) = N'{}' ; 

SET @json = JSON_MODIFY( @json, '$.readme', 'All transformation operations applied to this data are recorded here' ) ; 

PRINT @json ; 

The technique for performing multiple json modify operations is conceptually similar to performing multiple text replace operations. Multiple json modifications are demonstrated in a gist on SQL to JSON datetime modifications.sql by Bert Wagner.

Example Tables

Let’s illustrate the concept of recording the data lineage with a very basic progression through three temporary tables - load, stage, and store. Each table will have a data lineage column in which to record transformation metadata:

CREATE TABLE #tmp_load_tbl ( 
 load_id int identity(-2147483648,1) not null, 
 load_detail nvarchar(255), 
 load_data_lineage nvarchar(4000),
 CONSTRAINT tmp_load_tbl_chk_transformation_isjson CHECK( ISJSON(load_data_lineage)=1 ), 
 CONSTRAINT tmp_load_tbl_pk PRIMARY KEY CLUSTERED ( load_id ) WITH(DATA_COMPRESSION=PAGE)    
) ; 

CREATE TABLE #tmp_stage_tbl ( 
 stage_id int identity(-2147483648,1) not null, 
 stage_detail nvarchar(255), 
 stage_data_lineage nvarchar(4000), 
 CONSTRAINT tmp_stage_tbl_chk_transformation_isjson CHECK( ISJSON(stage_data_lineage)=1 ), 
 CONSTRAINT tmp_stage_tbl_pk PRIMARY KEY CLUSTERED ( stage_id ) WITH(DATA_COMPRESSION=PAGE)      
) ; 

CREATE TABLE #tmp_store_tbl ( 
 store_id int identity(-2147483648,1) not null,  
 store_detail nvarchar(255), 
 store_data_lineage nvarchar(4000), 
 CONSTRAINT tmp_store_tbl_chk_transformation_isjson CHECK( ISJSON(store_data_lineage)=1 ), 
 CONSTRAINT tmp_store_tbl_pk PRIMARY KEY CLUSTERED ( store_id ) WITH(DATA_COMPRESSION=PAGE)    
) ;  

GO 

Procedures

Each table will have a corresponding procedure to populate it. They are almost - but not quite - identical.

CREATE PROC #tmp_load_upsert AS 

BEGIN 
SET NOCOUNT ON 
SET XACT_ABORT ON 
	
	 
	TRANSFORMATION_SEED: 
	
	DECLARE @new_transformation_record nvarchar(4000) = N'{}' ; 

	SELECT @new_transformation_record 
	
	TRANSFORMATION_DETAILS: 
	
	DECLARE @mod_transformation_record nvarchar(4000) = N'{}' ; 

	DECLARE @tbl_transformation  table ( 
	 transform_id int
	,transform_name nvarchar(550) 
	,transform_date_local datetime 
	,transform_date_utc datetime2(7) 
	) 

	INSERT INTO @tbl_transformation ( 
	 transform_id 
	,transform_name
	,transform_date_local 
	,transform_date_utc 
	) 
	SELECT 
	 transform_id = @@PROCID 
	,transform_name = OBJECT_NAME( @@PROCID ) 
	,transform_date_local = GetDate() 
	,transform_date_utc = SYSUTCDATETIME() 
	; 

	
	SET @mod_transformation_record = ( 
		SELECT 
		 transform_id as load_transform_id
		,transform_name as load_transform_name  
		,transform_date_local as load_transform_date_local  
		,transform_date_utc as load_transform_date_utc  
		FROM @tbl_transformation t 
		FOR JSON PATH, WITHOUT_ARRAY_WRAPPER, INCLUDE_NULL_VALUES  
	) 
	; 

	PRINT @mod_transformation_record 

	MAIN_QUERY: 

	TRUNCATE TABLE #tmp_load_tbl ; 
 

	INSERT INTO #tmp_load_tbl ( 
		 load_detail 
		,load_data_lineage 
	) 
	SELECT 
	 load_detail = 'this is a test record' 
	--,load_data_lineage = JSON_MODIFY( @new_transformation_record, '$.load', JSON_QUERY(@mod_transformation_record) ) 
	,load_data_lineage = 
		JSON_MODIFY( 
			JSON_MODIFY(
				JSON_MODIFY( 
					JSON_MODIFY( @new_transformation_record,  
					'$.load_transform_id', @@PROCID 
					), 
				'$.load_transform_name', OBJECT_NAME( @@PROCID ) 
				) , 
			'$.load_transform_date_local', CONVERT(nvarchar, GetDate(), 127)  /* --FORMAT( GetDate(), 'yyyy-MM-ddTHH:mm:ss.fff') --'yyyy-MM-ddTHH:mm:ssZZ' */  
			),  
		'$.load_transform_date_utc', CONVERT(nvarchar, SYSUTCDATETIME(), 127)  ----FORMAT( SYSUTCDATETIME(), 'yyyy-MM-ddTHH:mm:ss.fffZZ') -- 'O'  
		) 

END ; 


CREATE PROC #tmp_stage_upsert AS 

BEGIN 
SET NOCOUNT ON 
SET XACT_ABORT ON 
	 
	
	TRANSFORMATION_DETAILS: 
	
	DECLARE @mod_transformation_record nvarchar(4000) = N'' ;

	DECLARE @tbl_transformation  table ( 
	 transform_id int
	,transform_name nvarchar(550) 
	,transform_date_local datetime 
	,transform_date_utc datetime2(7) 
	) 

	INSERT INTO @tbl_transformation ( 
	 transform_id 
	,transform_name
	,transform_date_local 
	,transform_date_utc 
	) 
	SELECT 
	 transform_id = @@PROCID 
	,transform_name = OBJECT_NAME( @@PROCID ) 
	,transform_date_local = GetDate() 
	,transform_date_utc = SYSUTCDATETIME() 
	; 

	
	SET @mod_transformation_record = ( 
		SELECT 
		 transform_id as stage_transform_id
		,transform_name as stage_transform_name  
		,transform_date_local as stage_transform_date_local  
		,transform_date_utc as stage_transform_date_utc  
		FROM @tbl_transformation t 
		FOR JSON PATH, WITHOUT_ARRAY_WRAPPER, INCLUDE_NULL_VALUES  
	) 
	; 

	PRINT @mod_transformation_record  

	MAIN_QUERY: 

	TRUNCATE TABLE #tmp_stage_tbl ; 

	INSERT INTO #tmp_stage_tbl ( 
		 stage_detail 
		,stage_data_lineage 
	) 
	SELECT 
	 stage_detail = l.load_detail 
	,stage_data_lineage = 
		JSON_MODIFY( 
			JSON_MODIFY(
				JSON_MODIFY( 
					JSON_MODIFY( load_data_lineage,  
					'$.stage_transform_id', @@PROCID 
					), 
				'$.stage_transform_name', OBJECT_NAME( @@PROCID ) 
				) , 
			'$.stage_transform_date_local', CONVERT(nvarchar, GetDate(), 127)  
			),  
		'$.stage_transform_date_utc', CONVERT(nvarchar, SYSUTCDATETIME(), 127) 
		) 
	
	FROM #tmp_load_tbl as l 

END ; 


CREATE OR ALTER PROC #tmp_store_upsert AS 

BEGIN 
SET NOCOUNT ON 
SET XACT_ABORT ON 
	 
	

	TRANSFORMATION_DETAILS: 
	
	
	DECLARE @mod_transformation_record nvarchar(4000) = N'' ;

	DECLARE @tbl_transformation  table ( 
	 transform_id int
	,transform_name nvarchar(550) 
	,transform_date_local datetime 
	,transform_date_utc datetime2(7) 
	) 

	INSERT INTO @tbl_transformation ( 
	 transform_id 
	,transform_name
	,transform_date_local 
	,transform_date_utc 
	) 
	SELECT 
	 transform_id = @@PROCID 
	,transform_name = OBJECT_NAME( @@PROCID ) 
	,transform_date_local = GetDate() 
	,transform_date_utc = SYSUTCDATETIME() 
	; 

	
	SET @mod_transformation_record = ( 
		SELECT 
		 transform_id as store_transform_id
		,transform_name as store_transform_name  
		,transform_date_local as store_transform_date_local  
		,transform_date_utc as store_transform_date_utc  
		FROM @tbl_transformation t 
		FOR JSON PATH, WITHOUT_ARRAY_WRAPPER, INCLUDE_NULL_VALUES  
	) 
	;  

	PRINT @mod_transformation_record  

	MAIN_QUERY: 

	TRUNCATE TABLE #tmp_store_tbl ; 
 

	INSERT INTO #tmp_store_tbl ( 
		 store_detail 
		,store_data_lineage 
	) 
	SELECT 
	 store_detail = s.stage_detail 
	,store_data_lineage = 
		JSON_MODIFY( 
			JSON_MODIFY(
				JSON_MODIFY( 
					JSON_MODIFY( stage_data_lineage,  
					'$.store_transform_id', @@PROCID 
					), 
				'$.store_transform_name', OBJECT_NAME( @@PROCID ) 
				) , 
			'$.store_transform_date_local', CONVERT(nvarchar, GetDate(), 127)  
			),  
		'$.store_transform_date_utc', CONVERT(nvarchar, SYSUTCDATETIME(), 127) 
		) 
	
	FROM #tmp_stage_tbl as s

END ;

GO  

Let’s now run the procedures:

EXEC #tmp_load_upsert ; 
EXEC #tmp_stage_upsert ; 
EXEC #tmp_store_upsert ; 

Viewing the data lineage

While json is human-readable at a push, you really need a json-aware text editor to be able to see it laid out comprehensibly.

Query:


SELECT store_data_lineage FROM #tmp_store_tbl ; 

Result:

{
   "load_transform_id":-1499977385, 
   "load_transform_date_local":"2022-07-17T16:24:37.747", 
   "load_transform_date_utc":"2022-07-17T04:24:37.7486210", 
   "stage_transform_id":-1467977271, 
   "stage_transform_date_local":"2022-07-17T16:24:37.753", 
   "stage_transform_date_utc":"2022-07-17T04:24:37.7566145", 
   "store_transform_id":-1435977157, 
   "store_transform_date_local":"2022-07-17T16:24:37.760", 
   "store_transform_date_utc":"2022-07-17T04:24:37.7626108"
}

However we can also shred the json in tsql using the with clause to the openjson method.

View Definition:

SELECT 
 tbl.store_id 
,tbl.store_detail   

,js.load_transform_id 
,js.load_transform_name 
,js.load_transform_date_local 
,js.load_transform_date_utc 

,js.stage_transform_id 
,js.stage_transform_name 
,js.stage_transform_date_local 
,js.stage_transform_date_utc

,js.store_transform_id 
,js.store_transform_name 
,js.store_transform_date_local 
,js.store_transform_date_utc  

FROM #tmp_store_tbl as tbl
OUTER APPLY OPENJSON ( tbl.store_data_lineage ) 
WITH ( 
	 load_transform_id int N'$.load_transform_id' 
	,load_transform_name nvarchar(550) N'$.load_transform_name' 
	,load_transform_date_local datetime N'$.load_transform_date_local' 
	,load_transform_date_utc datetime2(7) N'$.load_transform_date_utc' 

	,stage_transform_id int N'$.stage_transform_id' 
	,stage_transform_name nvarchar(550) N'$.stage_transform_name' 
	,stage_transform_date_local datetime N'$.stage_transform_date_local' 
	,stage_transform_date_utc datetime2(7) N'$.stage_transform_date_utc' 

	,store_transform_id int N'$.store_transform_id' 
	,store_transform_name nvarchar(550) N'$.store_transform_name' 
	,store_transform_date_local datetime N'$.store_transform_date_local' 
	,store_transform_date_utc datetime2(7) N'$.store_transform_date_utc'

) as js   

View Result Table:

store_id store_detail load_transform_id load_transform_name load_transform_date_local load_transform_date_utc stage_transform_id stage_transform_name stage_transform_date_local stage_transform_date_utc store_transform_id store_transform_name store_transform_date_local store_transform_date_utc
-2147483648 this is a test record -1499977385 NULL 2022-07-17 16:24:37.747 2022-07-17 04:24:37.7486210 -1467977271 NULL 2022-07-17 16:24:37.753 2022-07-17 04:24:37.7566145 -1435977157 NULL 2022-07-17 16:24:37.760 2022-07-17 04:24:37.7626108

QED

Categories:

Updated: