Friday, August 21, 2009

ETL Testing Process Overview

This is a compilation of my friend Rekha Ram Burra's experiences while testing ETL applications. I just borrowed this information from him to share publicly on my Blog. Thanks a lot to Rekharam being kind enough to allow me to reuse this content for publishing in my blog.

This is a compilation of the test activities performed while testing an ETL application. Though ETL stands for Extract, Transform and Load, as the final target of the test team is to validate the data in destination database, Test team should only verify data between source and destination. It is not compulsory that Test team should verify how data is extracted and transformed. We can consider this as Black Box Testing.

Assumptions

There are four different databases (on same server or different servers) in place.

  1. At least one Source database system which is on a different server than the below databases
  2. A Delta database that holds source data in tables with schema similar to source table schema
  3. Staging database where the data that is needed for the transformations will reside with any intermediate tables needed
  4. A Factory database where the data after required transformations will be loaded into tables
  5. Warehouse database which is the final database which is a data warehouse

There can be changes to the above architecture. E.g. both the delta and staging databases can be combined to one single database still with different tables for delta and staging. The general architecture would be to maintain Staging, Factory and Warehouse databases separately (if not on different servers) to reduce load on the warehouse.

Test approach

Verify the data type of the columns between source and destination

Between Source and Delta

Test should make sure that the data type of each column of each table is as per the Functional Specification. If no specific details are mentioned in Functional Specification (FS) about the table schema, then test should make sure on the below:

  1. The data type of the source column and the destination column are same
  2. The destination column length is Equal to or Greater than the Source column length

Between Source and Factory

  1. The data type of the source column and the destination column are same
  2. The destination column length is Equal to or Greater than the Source column length
  3. If the data type is different between source and destination then test should make sure that it is documented and no data is lost

Between Factory and Warehouse

The schema of tables will be exactly same between Factory and Warehouse. The below query can be used to check the table schema. Please note, the below are linked queries assuming that the Factory and Warehouse databases are on different servers. You may have to add the servers as linked servers using SP_ADDLINKEDSERVER/ remove the linked servers if the databases are on same server.

DECLARE @FactColCount int, @RepColCount int, @MatchCount int

SELECT @FactColCount = COUNT(*) FROM <FactoryServerName>.<FactoryDatabaseName>.dbo.sysobjects so INNER JOIN <FactoryServerName>.<FactoryDatabaseName>.dbo.syscolumns sc ON so.id = sc.id WHERE so.type = 'U' AND so.name = <Factory Table name to check>

SELECT @RepColCount = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.Sysobjects so INNER JOIN <WarehouseServername>.<WarehouseDatabaseName>.dbo.syscolumns sc ON so.id = sc.id

WHERE so.type = 'U' AND so.name = <Warehouse Table name to check>

SELECT @MatchCount = COUNT(*) FROM <FactoryServerName>.<FactoryDatabaseName>.dbo.sysobjects so1, <FactoryServerName>.<FactoryDatabaseName>.dbo.syscolumns sc1, <WarehouseServername>.<WarehouseDatabaseName>.dbo.sysobjects so2, <WarehouseServername>.<WarehouseDatabaseName>.dbo.syscolumns sc2 WHERE so1.type = 'U' AND so1.id = sc1.id AND so1.name = <Factory Table name to check> AND so2.type = 'U' AND so2.id = sc2.id AND so2.name = <Warehouse Table name to check> AND (sc1.name = sc2.name AND sc1.colorder = sc2.colorder AND sc1.name = sc2.name AND sc1.length = sc2.length AND sc1.isNullable = sc2.isNullable)

IF (@FactColCount <> @RepColCount)

SELECT 'ERROR: The number of columns (' + convert(varchar(20), @FactColCount) + ') in Factory does not match with the number of columns (' + convert(varchar(20), @RepColCount) + ') in Warehouse'

ELSE

IF (@FactColCount <> @MatchCount)

SELECT 'ERROR: The number of columns (' + convert(varchar(20),@RepColCount) + ') in both Factory and Warehouse matches. But some of the attributes does not match'

ELSE

SELECT 'SUCCESS: The tables are identical in structure'

GO

Note: The above query uses SysObjects and SysColumns which are discouraged on SQL 2008. You may consider changing it as per your need

Verifying the data

Check the number of records in source and destination

A simple SELECT COUNT(*) FROM <Table Name> with needed WHERE condition should be enough to check whether there are same number of records on source and destination. But we cannot rely on just this query, reason: assume one record is duplicated into destination instead of another record. In this case, the number of records will be same but there is one record missing from source.

We may encounter two kinds of scenarios:

a.    When the data in source and destination tables can be related using a Key

DECLARE @SrcRowCount int, @DestRowCount int

DECLARE @SrcExtraRows int, @DestExtraRows int

SELECT @SrcRowCount = COUNT(<ID Column>) FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /*Any Other Join Conditions that are needed WHERE Conditions needed */

SELECT @DestRowCount = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName>

SELECT @SrcExtraRows = COUNT(*) FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /* Any Other Join Conditions that are needed WHERE Conditions needed */

AND gcd.<ID Column> NOT IN

(

    select <ID Column> from <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName> /*Do not forget to filter the Zero Key if the table has one*/

)

SELECT @DestExtraRows = COUNT(*) FROM <WarehouseServername>.<WarehouseDatabaseName>.dbo.<WarehouseTableName> WHERE <ID Column> NOT IN

(

    SELECT gcd.<ID Column> FROM <Source Server>.<Source Database>.dbo.<SourceTable> gcd /* Any Other Join Conditions that are needed WHERE Conditions needed */

)

IF (@SrcRowCount <> @DestRowCount)

PRINT 'ERROR: The number of rows in destination table does not match with the number of rows in source.'

ELSE IF (@SrcExtraRows > 0)

PRINT 'ERROR: All the records of source table have not migrated to Destination table'

ELSE IF (@DestExtraRows > 0)

PRINT 'ERROR: There are extra Ids in Destination table that are not present in the Source.'

ELSE

PRINT 'SUCCESS: The tables have same row counts and same Ids'

b.    When the data in source and destination tables cannot be related using a Key (There may be cases where every time the destination is just truncated and repopulated with the source data). In this case,

  1. Get the count of records from source grouping by all columns
  2. Get the count of records from source grouping by all columns
  3. Compare the count from (i) and ii and it should be same

Verifying Straight Map columns

If there is only one source table for a destination Warehouse table, identify all the columns that are just copied from source and stored in destination. Use the below query to verify the column values

SELECT TOP 10 'Exp StraightMapColumn1' = Source.StraightMapColumn1, 'Act StraightMapColumn1' = Destination.StraightMapColumn1, FROM <Source Server>.<Source Database>.dbo.<Source Table> Source JOIN <WarehouseServername>.<WarehouseDatabaseName>.dbo.<Destination Table> Destination ON Source.<Id Column> = Destination.<Id Column> WHERE Source.StraightMapColumn1 <> Destination.StraightMapColumn1

If there are more than one source table from which few columns are just copied into destination table, we can group the columns by source table and use the above query for each group.

Verifying Derived Columns

Simple derivations like coalesce can be directly done using the same straight map query with a few changes. If there are any complex calculations for a column, then it is always suggested to create a temporary table only with two columns an Id column that can be used to map to the source and destination tables and the actual column that needs to be tested. Following the logic given in Functional Specification (FS) we have to populate the temporary table (This is similar to what dev do but Test also should do it to validate the column values). Once the temporary table is populated, we can use the same above straight map query to compare data between the temporary table and the destination table.

Verifying Zero Key row

There will be a record (may not be in every warehouse) with the surrogate key value as 0. This record contains the default values for each column. This record must be validated with the values given in the Functional Specification (FS).

Conclusion

This document gives a basic idea of the approach that can be followed to test the finally loaded data. There may be a lot other approaches for similar kind of application testing. Once you get used to the approach, you can further refine the approach by automating the test cases to pick up the column names automatically and using Dynamic SQL. Once accustomed with the queries, you can start looking into the code for any kind of discrepancies that might occur due to data (e.g. looking into queries that might generate duplicate data etc…)

Monday, April 13, 2009

Project Vs Product Testing

Imagine a situation where you are working on testing a project and you have completed testing of the project. You got a new assignment and you are going to test a Product this time.

Now, what kind of shift you need to make in your approach, scope, areas of importance between testing a project and product is what is going to be discussed in this whitepaper.

Project Testing

Normally Project will be specific to single client's requirement. Everything in a project revolves around a single client, his requirements, pain points etc.

A Project can be executed using any of the SDLC models. But the most preferred models for executing a project are

  • Iterative
  • Prototype
  • Agile

If you observe Waterfall model is not the most preferred SDLC for executing a project. This is because, clients requirements will keep on changing as time passes and these changes need to be absorbed by the project team and final delivery should be made as per the changed requirements.

Below listed are the various types of testing to be given importance and should be taken into consideration when testing a Project.

Functionality Testing: Functionality of the application is of utmost importance in case of project. Also the functionality should exactly match with the project specifications or requirements.

User Interface Testing: User interface has to be tested for boundary, tab order, any limitations on data to be entered, punctuations, formats to be followed for data entry, ease of use, Spelling and grammar etc.

Security Testing: Now a day's more emphasis is given on security due to the various security threats a project has to deal with after it is live in production. So the application has to be tested for any existence of security loop holes, un-authorized accesses in both front-end and back-end. Some Examples of security checks are cross site scripting, SQL injection etc.

Performance and Load Testing: These are to be given due importance while testing a project. Performance and benchmarks and expected user Load along with mix of users should be clearly identified and defined before starting the testing, Load and performance test tools are best suited for doing this kind of testing.

Concurrency Testing: Concurrency play vital role in case of financial domain related applications. Even this is important for non-financial domain applications. This testing is little difficult to perform manually. So using automation tools is a good option to do this testing. Concurrency scenarios should be clearly defined and automation tool should be configured properly for getting best results.

Product Testing

When we talk about Product the requirements will be a common pool of requirements gathered from several clients/end users and prioritized and frozen for a release/version.

The most preferred SDLC models for executing a product are

  • Waterfall
  • Incremental

Below listed are the various types of testing to be taken into consideration when testing a Product besides normal functional, UI, load and performance testing.

Installation Testing: We should test a product for proper installation and this testing focus on the proper installation of the product based on the instructions provided in the installation guide. The installation should be simple and should involve minimal human intervention for greater customer satisfaction.

Fail Over/Recovery Testing: Every product should handle failures gracefully and recover from the failures with no or minimum loss of data. The main focus in this testing is to simulate the possible failure scenario and observe how the failure is handled. Once the system is back to normal situation how the recovery is made and whether there is any data loss in the failure and recovery process.

Help/User Guide: For any product Help document or the User Guide is heart as the product can be purchased and used by any one. And the first this user will refer to is Help or User guide for using the product or solving any problem while using the product. So these documents should be tested thoroughly for completeness, clarity. All navigations, re-directions, links should be checked for proper functioning in the Help or User Guide.

Portability Testing: In this testing we need to test whether the product works correctly on various configurations, Operating Systems, Hardware etc.

User Friendliness/Usability: In this testing we need to check how easy/difficult for the user to use the product. Can people with disabilities use this product easily? Any tools or options provided for disabled users to use the application easily without any problems?

Copyright Information and Packaging: Good packaging of a product attracts interest of more number of users. And in case of a product Copyright information is of utmost importance. The organization can land into legal problems if copyright information is not proper.

Privacy and Confidentiality: In these days a lot of emphasis given on privacy and Confidentiality of the information handled using a product. So this aspect should be given due importance when testing a product.

Parallel Testing: This type of testing comes into picture and should be given due importance when a product developed in one technology is re-developed or migrated into a new technology. In this testing we will compare the functional output of the new product with output of the old product. We have to compare the functional scenarios and we should get the same output in both versions.

Project Vs Product Testing

Below is a snapshot of some differences between a project and product

Project

Product

Requirements will be specific to a single client

Requirements will be derived from a common pool of several clients/end users

Iterative, Prototype and Agile SDLC models are most preferred

Waterfall and Incremental SDLC model are most preferred

Requirements will be dynamic and may change very frequently

Requirements will be constant most of the times

Project will be delivered to the client directly

Product will be released in the market and can be purchased by any one

Test Automation is not encouraged in case of a project where the requirements change very frequently

Test Automation will help immensely in case of testing a product

Projects will be planned for the immediate use with less focus on future extendibility or versions

Products will be planned and will have several newer versions and covers several years to come

Project durations will be shorter in comparison with products (Maintenance/ Sustenance are exceptions)

Product durations will be longer generally in comparison with Projects

Conclusion

The conclusions we can derive from the above discussion are

  • Testing a Product and Project are not one and the same.
  • Targeted audience for a Product and a Project are different hence the testing strategy and approach should also be different
  • Different SDLC models are preferred for project and product.
  • Each testing should focus on different areas of testing.
  • Automation will help some parts of Project testing and major parts of Product testing.