This article was originally published on the Shipyard Blog
Crafting a good test dataset can be an intimidating task — especially when you’re starting from scratch. You want to make sure your data reflects production without sacrificing anonymity or skewing excessively large. We’ve seen many engineers create test data, and we’ve learned some valuable lessons along the way — some of which might help you with the initial challenges of building a strong dataset.
Why good test data is important
When skimming through a database, you’ve definitely noticed the variability within every single column. Of course, there’s a normal distribution and most points fall within a range, but still, within any column there are quite a few outliers. This is important — these are your edge and corner cases. You need to test for them. (Cue that article where American Airlines auto-designated the 101 year old woman as an infant). The last thing you need is a user getting bugged out because they’re using the same rewards code they’ve had since 1986, and it has one fewer digit than your system expects.
As mere humans, we’re frankly not the strongest when it comes to generating randomness. And while computers aren’t capable of true randomness, they can approximate it pretty closely. So when it comes time to create test data, it’s hard for us to predict those edge and corner cases and think of every little possible abnormality.
But if we don’t have test data that reflects the range and randomness of production, we’re genuinely not getting our money’s (or CI runners’) worth of our end-to-end tests. You want to test for every possible scenario out there, because it will be an order of magnitude more detrimental/expensive/inconvenient/reputation-damaging when it occurs in production and something breaks. The quality of your data goes hand-in-hand with the quality of your testing; both are crucial when it comes to delivering reliable, high-quality software.
Creating your test data
Your test dataset is something that you’ll probably never quite be done with. You’ll keep iterating and improving it, as well as factoring in new data points to try and break things. But everyone has to start somewhere. Here are some popular methods we’ve found engineers using to create their initial test datasets.
The human way
Many engineers are still creating test data by hand — row by row. This is tricky, but it’s the best way to control what’s included in your test database. The major downsides are that this is extremely time-consuming, and (as a human) you won’t be able to make this data as random as you might think.
This is a great practice for teams that need small datasets and are pretty familiar with the profile of the data they’re working with.
The “classic” automated way
For larger datasets, you might want to use automation to your advantage. You can set up a script pretty easily to generate fake test data in the format you need. The Faker Python library comes in clutch here. Faker offers a number of official and community “providers”, which you can use to import and generate data specific to your industry/application (eg. automobile, geo, passport, etc.).
Generating data this way is super efficient — your team is probably pretty comfortable with Python, and this makes it easy to continually iterate/customize the format as you go.
The GenAI way
If there’s one thing that generative AI does consistently get right, it’s learning from patterns. If you show an LLM a sample from your database, it can quickly and easily approximate it into a larger test database, and since it’s learning from all types of existing databases, it can often get a realistic result.
However, anyone who has used LLMs knows that they do have a huge weakness: they tend to hallucinate. Make sure you double-check your database to make sure the values are looking realistic. It is also a good idea to anonymize the resulting database, in case of the likely event that the LLM leaked in any real names/emails/addresses/other PII. Tools like Tonic Structural and Neosync use LLMs to generate synthetic data from your production data, but with guardrails to guarantee the data is anonymized and realistic.
The sampling way
The best approximation of your production data is, well, a sample of your production data. You can select a small subset of your dataset using a technique like stratified random sampling, making sure your limited selection reflects the variance of your entire official dataset.
From here, you will want to be very careful, as this is genuinely sensitive information. Remove all PII and substitute this with synthetic values from Faker. Shuffle the remaining columns so that each entry no longer almost corresponds to a real data entry.
How much test data do you need?
For most applications, you’ll only need a test dataset a fraction of the size of your production dataset. This test data should account for any and all edge/corner cases but still look pretty “normal” against your prod data. This might take a few hundred entries, sizing up to maybe a gigabyte total.
Remember, larger datasets mean longer test times. If you’re running your test suite multiple times per day, you’ll want to spare minutes of runtime where you can.
The amount of test data you need might be smaller than you think. The best way to find the sweet spot when it comes to your dataset’s size is to start small, and add more entries when you need to.
Cleaning your test data
While this will likely won’t be an issue if you’re the one creating the test data, you’ll want to double-check your data is “clean” — meaning it is high-quality and error-free.
The most important things to do here are getting your data’s range correct (adjusting scale and range to actually reflect production), filling in any missing datapoints, fixing entry errors, and harmonizing data (making strings consistent). You can automate most of these processes with data processing libraries like Pandas.
Test data: an art and a science
There are a lot of unanswered questions out there when it comes to creating test data from scratch. In short, you’ll want to remember a few things:
- Keep it brief: you don’t need terabytes of data to capture the intricacies of your application’s behavior
- Use automation: regardless of how you create your data, you can use automation to your advantage — it’s time-consuming and difficult to get right by hand
- Clean it up: make sure your data is actually high-quality and formatted/scaled correctly before testing with it
If you want to test earlier and more often, give Shipyard a try. You can easily integrate your test data with full-stack ephemeral environments for frequent and reliable testing against production-like infrastructure.