This is a compilation of my friend Rekha Ram Burra's experiences while testing ETL applications. I just borrowed this information from him to share publicly on my Blog. Thanks a lot to Rekharam being kind enough to allow me to reuse this content for publishing in my blog.
This is a compilation of the test activities performed while testing an ETL application. Though ETL stands for Extract, Transform and Load, as the final target of the test team is to validate the data in destination database, Test team should only verify data between source and destination. It is not compulsory that Test team should verify how data is extracted and transformed. We can consider this as Black Box Testing.
Assumptions
There are four different databases (on same server or different servers) in place.
There can be changes to the above architecture. E.g. both the delta and staging databases can be combined to one single database still with different tables for delta and staging. The general architecture would be to maintain Staging, Factory and Warehouse databases separately (if not on different servers) to reduce load on the warehouse.
Test approach
Verify the data type of the columns between source and destination
Between Source and Delta
Test should make sure that the data type of each column of each table is as per the Functional Specification. If no specific details are mentioned in Functional Specification (FS) about the table schema, then test should make sure on the below:
Between Source and Factory
Between Factory and Warehouse
The schema of tables will be exactly same between Factory and Warehouse. The below query can be used to check the table schema. Please note, the below are linked queries assuming that the Factory and Warehouse databases are on different servers. You may have to add the servers as linked servers using SP_ADDLINKEDSERVER/ remove the linked servers if the databases are on same server.
DECLARE @FactColCount int, @RepColCount int, @MatchCount int
SELECT @FactColCount = COUNT(*) FROM <FactoryServerName>.<FactoryDatabaseName>.dbo.sysobjects so INNER JOIN <FactoryServerName>.<FactoryDatabaseName>.dbo.syscolumns sc ON so.id = sc.id WHERE so.type = 'U' AND so.name = <Factory Table name to check>
SELECT @RepColCount = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.Sysobjects so INNER JOIN <WarehouseServername>.<WarehouseDatabaseName>.dbo.syscolumns sc ON so.id = sc.id
WHERE so.type = 'U' AND so.name = <Warehouse Table name to check>
SELECT @MatchCount = COUNT(*) FROM <FactoryServerName>.<FactoryDatabaseName>.dbo.sysobjects so1, <FactoryServerName>.<FactoryDatabaseName>.dbo.syscolumns sc1, <WarehouseServername>.<WarehouseDatabaseName>.dbo.sysobjects so2, <WarehouseServername>.<WarehouseDatabaseName>.dbo.syscolumns sc2 WHERE so1.type = 'U' AND so1.id = sc1.id AND so1.name = <Factory Table name to check> AND so2.type = 'U' AND so2.id = sc2.id AND so2.name = <Warehouse Table name to check> AND (sc1.name = sc2.name AND sc1.colorder = sc2.colorder AND sc1.name = sc2.name AND sc1.length = sc2.length AND sc1.isNullable = sc2.isNullable)
IF (@FactColCount <> @RepColCount)
SELECT 'ERROR: The number of columns (' + convert(varchar(20), @FactColCount) + ') in Factory does not match with the number of columns (' + convert(varchar(20), @RepColCount) + ') in Warehouse'
ELSE
IF (@FactColCount <> @MatchCount)
SELECT 'ERROR: The number of columns (' + convert(varchar(20),@RepColCount) + ') in both Factory and Warehouse matches. But some of the attributes does not match'
ELSE
SELECT 'SUCCESS: The tables are identical in structure'
GO
Note: The above query uses SysObjects and SysColumns which are discouraged on SQL 2008. You may consider changing it as per your need
Verifying the data
Check the number of records in source and destination
A simple SELECT COUNT(*) FROM <Table Name> with needed WHERE condition should be enough to check whether there are same number of records on source and destination. But we cannot rely on just this query, reason: assume one record is duplicated into destination instead of another record. In this case, the number of records will be same but there is one record missing from source.
We may encounter two kinds of scenarios:
a. When the data in source and destination tables can be related using a Key
DECLARE @SrcRowCount int, @DestRowCount int
DECLARE @SrcExtraRows int, @DestExtraRows int
SELECT @SrcRowCount = COUNT(<ID Column>) FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /*Any Other Join Conditions that are needed WHERE Conditions needed */
SELECT @DestRowCount = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName>
SELECT @SrcExtraRows = COUNT(*) FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /* Any Other Join Conditions that are needed WHERE Conditions needed */
AND gcd.<ID Column> NOT IN
(
select <ID Column> from <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName> /*Do not forget to filter the Zero Key if the table has one*/
)
SELECT @DestExtraRows = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName> WHERE <ID Column> NOT IN
(
SELECT gcd.<ID Column> FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /* Any Other Join Conditions that are needed WHERE Conditions needed */
)
IF (@SrcRowCount <> @DestRowCount)
PRINT 'ERROR: The number of rows in destination table does not match with the number of rows in source.'
ELSE IF (@SrcExtraRows > 0)
PRINT 'ERROR: All the records of source table have not migrated to Destination table'
ELSE IF (@DestExtraRows > 0)
PRINT 'ERROR: There are extra Ids in Destination table that are not present in the Source.'
ELSE
PRINT 'SUCCESS: The tables have same row counts and same Ids'
b. When the data in source and destination tables cannot be related using a Key (There may be cases where every time the destination is just truncated and repopulated with the source data). In this case,
Verifying Straight Map columns
If there is only one source table for a destination Warehouse table, identify all the columns that are just copied from source and stored in destination. Use the below query to verify the column values
SELECT TOP 10 'Exp StraightMapColumn1' = Source.StraightMapColumn1, 'Act StraightMapColumn1' = Destination.StraightMapColumn1, FROM <Source Server>.<Source Database>.dbo.<Source Table> Source JOIN <WarehouseServername>.<WarehouseDatabaseName>.dbo.<Destination Table> Destination ON Source.<Id Column> = Destination.<Id Column> WHERE Source.StraightMapColumn1 <> Destination.StraightMapColumn1
If there are more than one source table from which few columns are just copied into destination table, we can group the columns by source table and use the above query for each group.
Verifying Derived Columns
Simple derivations like coalesce can be directly done using the same straight map query with a few changes. If there are any complex calculations for a column, then it is always suggested to create a temporary table only with two columns an Id column that can be used to map to the source and destination tables and the actual column that needs to be tested. Following the logic given in Functional Specification (FS) we have to populate the temporary table (This is similar to what dev do but Test also should do it to validate the column values). Once the temporary table is populated, we can use the same above straight map query to compare data between the temporary table and the destination table.
Verifying Zero Key row
There will be a record (may not be in every warehouse) with the surrogate key value as 0. This record contains the default values for each column. This record must be validated with the values given in the Functional Specification (FS).
Conclusion
This document gives a basic idea of the approach that can be followed to test the finally loaded data. There may be a lot other approaches for similar kind of application testing. Once you get used to the approach, you can further refine the approach by automating the test cases to pick up the column names automatically and using Dynamic SQL. Once accustomed with the queries, you can start looking into the code for any kind of discrepancies that might occur due to data (e.g. looking into queries that might generate duplicate data etc…)
Friday, August 21, 2009
ETL Testing Process Overview
Subscribe to:
Posts (Atom)