Mock Data Best Practices for Software Testing

Proven strategies for generating test data that catches real bugs, supports realistic demos, and scales with your application.

Match Your Schema to Production

The most common mistake with mock data is generating fields that do not match the real schema. If your production users table has a created_at timestamp, a nullable middle_name, and a unique email constraint, your test data should reflect all of these properties.

Before generating mock data, export your production schema (without data) and use it as the template. This ensures your test data exercises the same constraints, data types, and relationships that exist in production.

-- Export schema only (PostgreSQL)
pg_dump --schema-only --no-owner mydb > schema.sql

-- Use the schema to inform your mock data fields:
-- users: id (UUID, unique), email (VARCHAR, unique, NOT NULL),
--         name (VARCHAR), middle_name (VARCHAR, nullable),
--         created_at (TIMESTAMP, NOT NULL)

Use Realistic Nullable Percentages

In production databases, optional fields are not always empty and are not always filled. A phone_number field might be null for 30% of users, while middle_name might be null for 60%. Setting appropriate nullable percentages in your mock data reveals bugs in code that assumes fields are always present.

A common pattern is to check your production data for actual null rates and replicate them:

-- Check null rates in production
SELECT
  column_name,
  COUNT(*) AS total,
  COUNT(column_value) AS non_null,
  ROUND(100.0 * (COUNT(*) - COUNT(column_value)) / COUNT(*), 1) AS null_pct
FROM users
GROUP BY column_name;

Enforce Unique Constraints

If a column has a unique constraint in production, your mock data must respect it. Generating duplicate email addresses or UUIDs will cause INSERT failures and mask real issues. Good mock data generators handle uniqueness automatically, retrying with different values when collisions occur.

Be especially careful with unique constraints when generating large datasets. With 10,000 rows, the probability of a collision in a short text field (like a username) is much higher than with 100 rows. Always verify that your generator can scale to your target row count without exhausting the value space.

Test with Multiple Locales

If your application supports international users, generate mock data in multiple locales. Names, addresses, phone numbers, and date formats vary significantly across cultures. A field that comfortably fits "John Smith" might overflow with a Thai or Arabic name.

Key areas where locale matters:

Character encoding: Does your database handle multi-byte UTF-8 characters (CJK, Arabic, emoji)?
String length: German compound words and Japanese names can be significantly longer than English equivalents.
Format validation: Postal codes, phone numbers, and national ID formats differ by country.
Sorting and collation: Alphabetical sorting differs across locales (e.g., ä sorts differently in German vs. Swedish).

Design Relational Data Carefully

When generating data for multiple related tables, the order of generation matters. Parent tables must be populated before child tables so that foreign key references are valid. A well-structured relational mock dataset follows these rules:

Generate parents first: Users before orders, products before order items.
Reference real IDs: Foreign keys in child tables should point to actual IDs from the parent table, not random values.
Maintain cardinality: If a typical user has 3-5 orders, generate that distribution rather than giving every user exactly one order.
Preserve business rules: An order's total should match the sum of its line items. A subscription's end_date should be after its start_date.

Seed for Reproducibility

Always use a seed value when generating mock data for tests. A seeded generator produces the same output every time, which means:

Test failures are reproducible across different machines and CI runs.
You can share a failing test case with just the seed value.
Snapshot tests remain stable between runs.
Code reviews can verify expected output against actual output.

// JavaScript example with Faker.js
import { faker } from '@faker-js/faker';

// Same seed = same data every time
faker.seed(42);
const user = {
  name: faker.person.fullName(),   // Always "Arlene Streich"
  email: faker.internet.email(),    // Always the same email
};

Size Your Datasets Appropriately

Different test scenarios need different dataset sizes:

Unit tests (5-20 rows): Small datasets that cover specific logic paths. Fast to generate, easy to reason about.
Integration tests (100-1,000 rows): Enough data to test pagination, filtering, sorting, and aggregation queries.
Performance tests (10,000+ rows): Large datasets that reveal query performance issues, memory leaks, and rendering bottlenecks.
Demo environments (50-500 rows): Enough to look realistic without overwhelming the UI.

Include Edge Cases Deliberately

Beyond typical data, your mock dataset should include known edge cases:

Empty strings vs. null values
Very long strings (e.g., a 500-character product description)
Special characters: quotes, ampersands, angle brackets, backslashes
Unicode edge cases: emoji, right-to-left text, zero-width characters
Boundary dates: epoch 0, far-future dates, leap day (Feb 29)
Numeric extremes: zero, negative numbers, very large numbers, floating point precision

For inspecting how your application handles unusual Unicode characters, our Character Inspector can identify invisible or potentially dangerous characters in your test strings.

Try It Yourself

Ready to generate test data? Our Mock Data Generator supports all the best practices described above: nullable percentages, unique constraints, locale selection, relational schemas, and export to JSON, CSV, SQL, and more.