Mock Data Best Practices for Software Testing
Proven strategies for generating test data that catches real bugs, supports realistic demos, and scales with your application.
Match Your Schema to Production
The most common mistake with mock data is generating fields that do not match the real schema. If your production users table has a created_at timestamp, a nullable middle_name, and a unique email constraint, your test data should reflect all of these properties.
Before generating mock data, export your production schema (without data) and use it as the template. This ensures your test data exercises the same constraints, data types, and relationships that exist in production.
-- Export schema only (PostgreSQL)
pg_dump --schema-only --no-owner mydb > schema.sql
-- Use the schema to inform your mock data fields:
-- users: id (UUID, unique), email (VARCHAR, unique, NOT NULL),
-- name (VARCHAR), middle_name (VARCHAR, nullable),
-- created_at (TIMESTAMP, NOT NULL)Use Realistic Nullable Percentages
In production databases, optional fields are not always empty and are not always filled. A phone_number field might be null for 30% of users, while middle_name might be null for 60%. Setting appropriate nullable percentages in your mock data reveals bugs in code that assumes fields are always present.
A common pattern is to check your production data for actual null rates and replicate them:
-- Check null rates in production
SELECT
column_name,
COUNT(*) AS total,
COUNT(column_value) AS non_null,
ROUND(100.0 * (COUNT(*) - COUNT(column_value)) / COUNT(*), 1) AS null_pct
FROM users
GROUP BY column_name;Enforce Unique Constraints
If a column has a unique constraint in production, your mock data must respect it. Generating duplicate email addresses or UUIDs will cause INSERT failures and mask real issues. Good mock data generators handle uniqueness automatically, retrying with different values when collisions occur.
Be especially careful with unique constraints when generating large datasets. With 10,000 rows, the probability of a collision in a short text field (like a username) is much higher than with 100 rows. Always verify that your generator can scale to your target row count without exhausting the value space.
Test with Multiple Locales
If your application supports international users, generate mock data in multiple locales. Names, addresses, phone numbers, and date formats vary significantly across cultures. A field that comfortably fits "John Smith" might overflow with a Thai or Arabic name.
Key areas where locale matters:
- Character encoding: Does your database handle multi-byte UTF-8 characters (CJK, Arabic, emoji)?
- String length: German compound words and Japanese names can be significantly longer than English equivalents.
- Format validation: Postal codes, phone numbers, and national ID formats differ by country.
- Sorting and collation: Alphabetical sorting differs across locales (e.g., ä sorts differently in German vs. Swedish).
Design Relational Data Carefully
When generating data for multiple related tables, the order of generation matters. Parent tables must be populated before child tables so that foreign key references are valid. A well-structured relational mock dataset follows these rules:
- Generate parents first: Users before orders, products before order items.
- Reference real IDs: Foreign keys in child tables should point to actual IDs from the parent table, not random values.
- Maintain cardinality: If a typical user has 3-5 orders, generate that distribution rather than giving every user exactly one order.
- Preserve business rules: An order's
totalshould match the sum of its line items. A subscription'send_dateshould be after itsstart_date.
Seed for Reproducibility
Always use a seed value when generating mock data for tests. A seeded generator produces the same output every time, which means:
- Test failures are reproducible across different machines and CI runs.
- You can share a failing test case with just the seed value.
- Snapshot tests remain stable between runs.
- Code reviews can verify expected output against actual output.
// JavaScript example with Faker.js
import { faker } from '@faker-js/faker';
// Same seed = same data every time
faker.seed(42);
const user = {
name: faker.person.fullName(), // Always "Arlene Streich"
email: faker.internet.email(), // Always the same email
};Size Your Datasets Appropriately
Different test scenarios need different dataset sizes:
- Unit tests (5-20 rows): Small datasets that cover specific logic paths. Fast to generate, easy to reason about.
- Integration tests (100-1,000 rows): Enough data to test pagination, filtering, sorting, and aggregation queries.
- Performance tests (10,000+ rows): Large datasets that reveal query performance issues, memory leaks, and rendering bottlenecks.
- Demo environments (50-500 rows): Enough to look realistic without overwhelming the UI.
Include Edge Cases Deliberately
Beyond typical data, your mock dataset should include known edge cases:
- Empty strings vs. null values
- Very long strings (e.g., a 500-character product description)
- Special characters: quotes, ampersands, angle brackets, backslashes
- Unicode edge cases: emoji, right-to-left text, zero-width characters
- Boundary dates: epoch 0, far-future dates, leap day (Feb 29)
- Numeric extremes: zero, negative numbers, very large numbers, floating point precision
For inspecting how your application handles unusual Unicode characters, our Character Inspector can identify invisible or potentially dangerous characters in your test strings.
Try It Yourself
Ready to generate test data? Our Mock Data Generator supports all the best practices described above: nullable percentages, unique constraints, locale selection, relational schemas, and export to JSON, CSV, SQL, and more.
Further Reading
- Test Data Management — Martin Fowler
The Test Data Builder pattern for constructing complex test objects.
- Database Testing Best Practices — PostgreSQL Wiki
Common database design mistakes that affect test data quality.
- OWASP Testing Guide — Test Data
Security testing perspectives on test data and input validation.