Misata: Outcome-Conformant Synthetic Data Generation

Muhammed Rasin

Quick Start

Install

pip install misata

Optional extras:

pip install "misata[mcp]"        # MCP server — use Misata from Claude / Cursor
pip install "misata[llm]"        # Groq / OpenAI / Claude / Gemini / Ollama schema generation
pip install "misata[documents]"  # PDF output via weasyprint
pip install "misata[advanced]"   # SDV/CTGAN statistical synthesis

Your first dataset

Write one sentence. Get back a dict[str, pd.DataFrame] with referential integrity, realistic distributions, and locale-accurate values.

import misata

tables = misata.generate(
    "A SaaS company with 5k users, monthly subscriptions, and 20% churn",
    seed=42,
)

for name, df in tables.items():
    print(f"{name}: {len(df):,} rows")
    print(df.head(3))
    print()
# users:          5,000 rows
# subscriptions:  5,000 rows
# invoices:      20,000 rows

Preview before generating

Use preview() to confirm what Misata understood before committing to a large generation run. It calls no generators and produces no rows.

import misata

report = misata.preview(
    "A fintech startup with 10k customers, 3% fraud rate, and IBAN accounts"
)

print(report.domain)             # "fintech"
print(report.domain_confidence)  # "high"
print(report.matched_keywords)   # ["fintech", "fraud"]
print(report.scale_params)       # {"users": 10000}
print(report.locale)             # None

print(report.table_preview)
# [{"name": "customers", "rows": 10000, "columns": 9},
#  {"name": "accounts",  "rows": 10000, "columns": 6},
#  {"name": "transactions", "rows": 80000, "columns": 8}]

print(report.warnings)           # [] — clean detection

print(report.summary())
# ✓ Domain: fintech  [high]  matched: fintech, fraud
# ✓ Scale: users=10,000
# ✓ Events: 0 detected
#
#   Will generate 3 table(s), 100,000 total rows:
#     customers      10,000 rows  (9 columns)
#     accounts       10,000 rows  (6 columns)
#     transactions   80,000 rows  (8 columns)

DetectionReport fields

Field	Type	Description
`domain`	`str \| None`	Detected domain code or `None`
`domain_confidence`	`str`	`"high"` (≥2 keywords), `"low"` (1 keyword), `"none"`
`matched_keywords`	`list[str]`	Keywords that fired for the winning domain
`near_misses`	`dict[str, list[str]]`	Other domains whose keywords also appeared
`scale_params`	`dict[str, int]`	Parsed numeric scale signals
`temporal_events`	`list[dict]`	Growth, churn, crash events detected
`locale`	`str \| None`	Auto-detected locale code (e.g. `"de_DE"`)
`table_preview`	`list[dict]`	`[{name, rows, columns}]` for each table
`total_rows`	`int`	Sum of all table row counts
`warnings`	`list[str]`	Fallback and ambiguity warnings

Narrative growth curves

Describe a growth trajectory in plain English, Misata builds exact per-month targets and shapes generated data to match.

# Ecommerce with seasonal story
tables = misata.generate(
    "Ecommerce store with 10k customers — "
    "revenue from $200k in Jan to $350k in Sep, "
    "Black Friday spike, Christmas peak, Q1 slump after holidays",
    rows=10_000, seed=42
)

# SaaS hockey-stick
tables = misata.generate(
    "SaaS startup with 2k users — MRR $5k in January, 10x growth over the year, "
    "strong Q4 push",
    rows=2000, seed=42
)

# B2B with summer slump
tables = misata.generate(
    "B2B SaaS platform with 1k enterprise customers — "
    "ARR $500k in Jan, doubled by December, summer slump",
    rows=1000, seed=42
)

All three pattern types compose freely:

Pattern type	Example phrase	What it does
Monthly anchor	`"MRR $50k in January"`	Pins exact value for that month
Quarter modifier	`"Q4 spike"`	Boosts Oct/Nov/Dec by 1.3×
Named event	`"Black Friday spike"`	November +1.55×
Multiplier	`"doubled by December"`	End value = 2× start
From–to	`"from $50k to $200k"`	Linear interpolation across the year

Full narrative patterns reference →

All 18 domains

Misata ships with schemas for 18 industry verticals:

Domain	Trigger keywords	Tables
`saas`	saas, mrr, arr, churn	users, subscriptions, invoices
`ecommerce`	ecommerce, orders, retail	customers, products, orders, order_items
`fintech`	fintech, payments, fraud	customers, accounts, transactions
`healthcare`	healthcare, patients, clinic	doctors, patients, appointments
`marketplace`	marketplace, sellers, listings	sellers, buyers, listings, orders
`logistics`	logistics, shipping, fleet	drivers, vehicles, routes, shipments
`hr`	hr, employees, payroll	departments, employees, payroll
`social`	social media, followers, feed	users, posts, follows, reactions, comments
`realestate`	real estate, mortgage	agents, properties, transactions
`pharma`	pharma, clinical, trials	researchers, projects, trials, timesheets
`fooddelivery`	food delivery, restaurants	restaurants, customers, couriers, orders
`edtech`	edtech, courses, students	instructors, courses, students, enrollments
`gaming`	gaming, players, leaderboard	players, matches, sessions, achievements
`crm`	crm, contacts, deals	companies, contacts, deals, activities
`crypto`	crypto, blockchain, defi	wallets, tokens, transactions, token_prices
`insurance`	insurance, policy, claims	customers, policies, claims, payments
`travel`	travel, hotel, flights	users, hotels, flights, bookings, reviews
`streaming`	streaming, netflix, subscribers	subscribers, content, watch_history, ratings

Full domain reference with column listings →

Inspect the schema first

schema = misata.parse("A fintech company with 10k customers")
print(schema.summary())
# Tables: customers, accounts, transactions
# Relationships: customers → accounts → transactions
# Rows: 10,000 / 25,000 / 80,000

tables = misata.generate_from_schema(schema, seed=42)

Locale support

Add a country to get locale-accurate names, addresses, national IDs, phone prefixes, and currency-appropriate values:

# German names, IBAN accounts, European address format
tables = misata.generate("A German fintech with 5k customers", seed=42)

# Brazilian locale — CPF national IDs, BRL salaries
tables = misata.generate("A Brazilian HR system with 200 employees", seed=42)

# UK healthcare — NHS numbers, British names
tables = misata.generate("A UK healthcare provider with 1k patients", seed=42)

Localisation reference →

Export

misata.to_csv(tables, "data/")
misata.to_parquet(tables, "data/")
misata.to_duckdb(tables, "data/dataset.duckdb")
misata.to_jsonl(tables, "data/")

CLI

# Generate from a story
misata generate --story "A marketplace with 10k sellers" --rows 10000

# Generate from YAML schema
misata init        # scaffold misata.yaml
misata generate    # reads misata.yaml automatically

# Profile a CSV file
misata validate customers.csv

# Preview what would be generated (no rows)
misata preview --story "A SaaS company with 5k users"

Use from an AI assistant

If you have misata[mcp] installed, wire it into Claude Desktop, Cursor, or Windsurf and describe datasets in plain English:

"Generate a fintech fraud dataset with 10k customers."

"Build me SaaS subscription data with MRR growing from $50k to $200k over the year."

MCP server setup →