PyPI Packagev0.5.2 — Latest

The Open-Source Engine

High-performance synthetic data generation. LLM-powered schema design. Vectorized NumPy execution. MIT licensed.

terminal
$ pip install misata
0K
rows/sec
0
GitHub ★
0
releases
0+
downloads
Quick Start

Three lines to realistic data

From zero to production-grade synthetic data in under 60 seconds.

01

Install

One dependency. No Docker, no config files.

pip install misata
02

Generate

Describe your schema in plain English.

misata generate \ --story "E-commerce with users, products, and orders" \ --use-llm
03

Use

Relational CSVs with referential integrity.

./generated_data/ ├── users.csv (1,000) ├── products.csv (500) ├── orders.csv (2,500) └── order_items.csv (5,000)
Capabilities

Not just random values.
Intelligent generation.

AI-Powered

LLM Schema Generation

Describe your data in plain English. Groq-powered AI designs the entire relational schema — tables, columns, types, foreign keys.

Performance

Vectorized NumPy Engine

385,000 rows/second. No row-by-row Python loops. Entire columns generated at once using NumPy vectorization.

Integration

SQLAlchemy Seeding

Point at your existing database. Misata introspects models, respects foreign keys, and seeds production-grade data.

Realism

Business Constraints

Define rules like max_daily_hours=8 or cost < price. Misata enforces them mathematically, not by retry.

Intelligence

Smart Row Proportions

15 categories, 100 users, 250 orders, 500 line items. Misata analyzes your FK graph to size tables realistically.

Core

Realism Engine v0.5

Every column is aware of every other column. Emails match names. Totals = subtotal + tax + shipping. Always.

Side-by-Side

Faker generates random noise.
Misata generates truth.

Field
Faker / Random
Misata v0.5.2
order.total
$847.23 (random)
$847.23 = $798.50 + $29.99 + $18.74
product.cost
$96.00 (> price!)
$41.20 (43% of price $95.81)
line_total
$3,291.00 (random)
$3,291.00 = 5 × $662.00 − $19.00
user.email
luke.ri@wanadoo.co.uk
emma.chen@gmail.com (from name)
rating
137 (wat?)
4 ★ (J-curve weighted)
delivered_at
2021-01-03 (before order!)
2024-03-15 (+7 days after order)
row counts
100 × every table
15 cat, 100 users, 500 items
Python API

First-class Python integration

Drop into any test suite, CI/CD pipeline, or data workflow.

LLM-Powered Generation

generate.py
from misata import DataSimulator from misata.llm_parser import LLMSchemaGenerator # 1. Design schema with AI llm = LLMSchemaGenerator(provider="groq") config = llm.generate_from_story( "Healthcare app with patients, " "doctors, and appointments" ) # 2. Generate data sim = DataSimulator(config) for name, df in sim.generate_all(): print(f"✓ {name}: {len(df)} rows") df.to_csv(f"{name}.csv", index=False)

SQLAlchemy Seeding

seed.py
from misata import seed_from_sqlalchemy_models from myapp.models import Base, engine # Automatically analyzes your models # and foreign keys report = seed_from_sqlalchemy_models( engine, Base, default_rows=10_000, create=True, smart_mode=True ) print(f"Seeded {report.total_rows} rows") print(f"Duration: {report.duration_seconds}s") print(f"Tables: {report.table_count}")
Changelog

Built in public, shipped fast

7 releases in 6 months. From genesis to realism engine.

v0.5.2latestFeb 2026

Business Constraints & Performance

  • Custom constraint engine (sum_limit, redistribute)
  • Performance optimizations across all generators
  • Improved CLI output formatting
v0.5.1Feb 2026

SQLAlchemy Seeding & Smart Mode

  • Direct SQLAlchemy model introspection
  • Smart row proportions from FK graph analysis
  • Database truncate + seed in one command
v0.5.0Jan 2026

The Realism Engine

  • Column-aware generation (every column knows about every other)
  • Mathematical consistency: total = subtotal + tax + shipping
  • Email-from-name, cost < price, temporal ordering
  • J-curve rating distributions
v0.3.1βDec 2025

Stability & Bug Fixes

  • Fixed edge cases in FK resolution
  • Improved error messages for invalid schemas
  • Test coverage improvements
v0.3.0βNov 2025

Foreign Key Graph

  • Topological sort for table generation order
  • FK-aware value referencing
  • Multi-level relationship support
v0.2.0βOct 2025

LLM Schema Generation

  • Groq integration for natural language → schema
  • Multiple LLM provider support (Groq, OpenAI)
  • Story-based generation from CLI
v0.1.0βSep 2025

Genesis — Core Engine

  • NumPy-vectorized column generation
  • CLI with --story flag
  • CSV export
  • Basic column type inference
Radical Transparency

Where we are vs. where we're going

We believe in building in public. Here's an honest look at the gap between our current state and our vision. This is why we need collaborators.

CLI-only interface
Visual Schema Studio (Figma-like drag & drop)shipped
Single-machine generation
Distributed workers for billion-row datasetsbuilding
Basic column-name heuristics
ML-powered column classification + value poolsbuilding
Manual constraint definition
Auto-inferred constraints from sample dataplanned
No privacy controls
PII detection, differential privacy, k-anonymityplanned
Solo developer workflow
Team workspaces, schema versioning, sharingplanned
CSV/JSON export
Direct DB seeding + API endpoints + data marketplaceplanned

Start generating in 60 seconds

MIT licensed. No signups. No credit card. Just pip install and go.