IBM Big Data and Analytics Hub website cited a case study, where a US insurance company was estimating 15% of their testing efforts to be just test data collection for the backend system and the frontend system.
To quote the study, “For every USD 14 million delivery by the software development and QA team, a hidden USD 3 million was being spent on data management. All data management tasks included moving data from back-end systems to identifying test data, data masking of sensitive data, skipped production defects due to unavailability of correct test data, manipulation of data for different scenarios, storage of test data.”
The test data management for the company has became a big problem and had to be solve. This helped the insurance company to save USD 400,000 annually, in the cost of testing.
Above example clearly states the importance and need for proper Test Data Management(TDM), also known as Software Test Data Management. What is test data?
Two types of test data
- Static data- This is the data which does not change after being recorded and usually comprises non-sensitive data like City name, PIN code etc.
- Dynamic data(Transactional data)- This data can change after being recorded and usually comprises sensitive data like the medical history of the client, number of employees etc.
For testing purposes, usually, a mix of static and dynamic data is needed. Data can be present in different formats, different databases and different types. Testing may require data from different sources according to a specific requirement of the Application Under Test (AUT).
Mostly the data which is used for testing is production data because it covers all types of different data which an application may encounter in a live environment.
Now, imagine a scenario where the transactional data containing credit card number, mobile number, bank login credentials are provided to the testing team for testing purposes.
In case of improper use of such critical and high-risk data, legal action by the customers is definite.
So how to test a business-critical banking application in such a case, without production data, where improper data will result in daunting production defects?
The answer is data-masking.
We will use the production data, after masking or hiding the sensitive information. This masking comes under TDM(Test Data Management), where we intend to keep the sensitive production data separate from the test data.
Let us understand a bit more about test data management (TDM).
What is test data management?
On Informatic, we find the definition of TDM as – “the creation of non-production data sets that reliably mimic an organization’s actual data so that system and application developers can perform rigorous and valid system tests.”
In simple terms, Test data management (TDM), is a process which involves management- planning, design, storage and retrieval of test data. TDM ensures that test data is of high quality, appropriate quantity, proper format and fulfills the requirement of testing data in a timely manner.
To create test data there are three approaches:
Copy production data
i. The actual production databases are copied or cloned in this approach.
ii. Due to the large size of the production database, it is a time-consuming process.
iii. Creates dependency on the production environment, the testing and development team cannot create the test data themselves.
iv. It is a high-risk process because the sensitive data of customers’ is at stake. If data breach happens then legal procedures may hinder the business badly.
Synthetic test data generation
i. A database administrator(DBA) creates and runs SQL queries on the database tables to gather the required test data.
ii. Expertise of the DBA is crucial, extensive knowledge of the schema, relationships, and database is required.
iii. It is time-consuming because query writing and running them on DB may take time.
iv. DBA needs to add all the negative and boundary value conditions as well in test data for testing.
Data subset creation
- i. Unlike the data cloning approach, different subsets of the production database are copied and not the whole database.
ii. This approach is time-efficient because a subset is copied, so not the whole database is involved.
iii. Skilled people are required to decide what data should be copied.
iv. Data masking is an important step in data subset creation. The sensitive data is masked, to rule out any data mishandling.
v. Data subset creation is the most used data creation approach in the test data management process. The other two approaches are usually avoided due to the cost involved and data sensitivity.
Steps for test data management
Analysis of Data requirement
This test data could be needed on different interfaces of the application. The format and type of data may also be different on these interfaces.
So, the first step is to understand the data requirement of the organization based on the test cases that will be run. This will require knowledge of the domain, business and all the applications involved in the whole end-to-end process.
Here, the person analyzing the test data requirement should have expertise in banking domain, CRM and financial application knowledge and messaging system also.
Data subset creation
As we have seen above, this is the most widely used data creation technique. The real production data is copied to provide different subsets which accommodate all the test data requirements.
Data Masking
We are dealing with sensitive production data, it is really important to hide the customer data like medical history, bank login information, phone number, credit/debit card information etc. Any failure to protect sensitive data may lead to compliance and regulatory issues.
Automation and tools
In TDM, automation can be used to perform the above tasks of data cloning, data generation and data masking. If done manually all these steps are really time-consuming and error-prone as we are dealing with huge data.
Maintenance and Refresh
There is a central repository of the test data, which has rules for access and privileges. The test data needs a periodic refresh to reflect the latest and most-relevant test data. If multiple modules in a project are using the same test data repository a properly managed refresh cycle is a necessity.
Along with data refresh, the maintenance of the repository is also very important. Over a period of time, the test data may become obsolete or redundant. There has to be proper maintenance of the test data to keep it consistent, correct and available over time.
Otherwise, such data will hold unnecessary storage space in the repository and the search for relevant test data may take longer than expected.
Why test data management is so important
Having a dedicated test data management team and a systematic TDM process in place has immense benefits for the organization and the customer.
Test data management benefits
- Increased test data coverage: TDM helps in having traceability of the test data to test cases and then to requirements. This provides a bird-eye view of the test data coverage and the defect patterns.
- Cost reduction by finding the bugs early: As seen in the previous point, there is better test data coverage and the traceability provides a clearer picture.
- Data is provisioned based on testing type: A unique feature which is provided by a TDM process is that the data is managed in one place. From the same repository, appropriate data can be provisioned for different testing types- Functional, Integration, Performance etc. This reduces redundant data copies, and hence the cost of storage is reduced.
- Reusability of data: Reusability is the most valuable feature of the TDM, as this helps in further reduction of cost. The reusable data is sorted out and is archived in a central repository for future use. Whenever the requirement for reusable data arises, the testers can use the archived data.
- To reduce copies of the data: In a project, multiple teams can make multiple copies of the same production data for their use. This results in redundant copies of the same data and storage space are misuse. When a TDM is used the same repository is used by all the teams and hence the storage space is utilized diligently.
- Customer’s trust: The key advantages of the TDM process are quality data and very good data coverage. The result is a stable and high-quality application, which has minimum production defects. Customer’s trust level in organization increases, when a customer gets to see such enticing results of adopting a TDM process.
Conclusion
Test data creation is performed by the testing team, usually, the testing team does not have direct access to the production data. Even if the production data is provided, it is a large chunk of raw data. This raw data cannot be used directly for testing purposes, a considerable effort is needed to sort, manage and tailor the data for use.
High-quality data is the basic need if we are planning to have high-quality software testing. Average data quality will provide mediocre results after testing, and no one ever wants that. To resolve all these problems test data management is the best solution.
With Agile and DevOps the testing cycles are getting smaller. To create quality data within that cycle along with performing software testing can get really complex. To reduce cost, time and efforts in the testing cycle -Test data management seems to be an ideal solution, with visible results. This instils a sense of satisfaction and trust in the customer, and better business is the outcome.