Plain-English Generation

The fastest path to multi-table synthetic data: write one sentence, get back a dict of DataFrames with referential integrity, realistic distributions, and locale-accurate values.

import misata

tables = misata.generate(
    "A fintech startup with 10k customers, 3% fraud rate, and IBAN accounts",
    rows=10_000,
    seed=42,
)
# Returns: {"customers": DataFrame, "accounts": DataFrame, "transactions": DataFrame}

What the parser extracts

Misata's StoryParser reads the story and infers four things before a single row is generated:

SignalExample phraseWhat happens
Domain"fintech", "saas", "ecommerce"Selects the domain schema (tables, columns, FK relationships)
Scale"10k customers", "500 employees"Sets row counts; child tables scale proportionally
Locale"German company", "Brazilian fintech"Applies country-accurate names, salaries, national IDs, phone prefixes
Growth curves"MRR from $50k in Jan to $200k in Dec"Shapes numeric distributions to match exact monthly targets

Preview before generating

Use preview() to confirm what Misata understood before committing to a large generation:

import misata

report = misata.preview(
    "A SaaS company with 5k users, MRR from $50k in Jan to $200k in Dec"
)

print(report.domain)            # "saas"
print(report.domain_confidence) # "high"
print(report.matched_keywords)  # ["saas", "mrr"]
print(report.scale_params)      # {"users": 5000}
print(report.locale)            # None (no locale detected)
print(report.table_preview)
# [{"name": "users", "rows": 5000, "columns": 12},
#  {"name": "subscriptions", "rows": 5000, "columns": 8}]
print(report.warnings)          # [] — clean detection

print(report.summary())
# ✓ Domain: saas  [high]  matched: saas, mrr
# ✓ Scale: users=5,000
# ✓ Events: 2 detected
#
#   Will generate 2 table(s), 10,000 total rows:
#     users          5,000 rows  (12 columns)
#     subscriptions  5,000 rows  (8 columns)

preview() calls no generators and produces no data — it is pure inspection.

DetectionReport fields

FieldTypeDescription
domainstr | NoneDetected domain code or None
domain_confidencestr"high" (≥2 keywords), "low" (1 keyword), "none"
matched_keywordslist[str]Keywords from the winning domain that appeared in the story
near_missesdict[str, list[str]]Other domains whose keywords also appeared
scale_paramsdict[str, int]Parsed numeric scale signals
temporal_eventslist[dict]Growth, churn, crash events detected
localestr | NoneAuto-detected locale code (e.g. "de_DE")
table_previewlist[dict][{name, rows, columns}] for every table
total_rowsintSum of all table row counts
warningslist[str]Fallback / ambiguity warnings

Domain detection — how it scores

Detection is scored, not first-match. For each domain:

  • +5 if the literal domain name appears in the story (e.g. "fintech" → fintech domain gets +5)
  • +1 per matched keyword

The highest-scoring domain wins. This means "a fintech company with churn" correctly detects as fintech even though "churn" is a SaaS keyword — "fintech" earns +5 and beats the single SaaS keyword hit.

If two stories are ambiguous, the near_misses field tells you which other domains also matched.

report = misata.preview("A fintech company with crypto wallets and 5k users")
print(report.domain)        # "fintech"  (+5 for "fintech" literal)
print(report.near_misses)   # {"crypto": ["crypto", "wallet"]}

Disambiguation tip

Name the domain explicitly and it always wins:

# Ambiguous
misata.generate("A platform with subscription payments and crypto wallets")

# Unambiguous — fintech wins because the word "fintech" scores +5
misata.generate("A fintech platform with subscription payments and crypto wallets")

Scale extraction

Any of these forms are recognised:

1000 users       → users: 1000
5k users         → users: 5000
1.5M customers   → users: 1500000
200 employees    → users: 200
500 doctors      → users: 500
10k orders       → orders: 10000
50k transactions → transactions: 50000

Child tables scale proportionally based on the domain's FK cardinality ratios. A SaaS company with 5k users automatically produces ~5k subscriptions and ~20k invoices (4× ratio).


Narrative growth curves

This is Misata's core differentiator: natural language maps to exact per-month targets that shape the generated data. Specify them in any order; Misata interpolates between control points.

Monthly anchors

# From–to with interpolation
misata.generate("SaaS company — MRR from $50k in January to $200k in December")

# Multiple control points
misata.generate("SaaS mrr $50k in Jan, $90k in June, $200k in December")

# Mixed: anchors + qualitative modifiers
misata.generate("SaaS mrr $50k in Jan, peak in November, $200k in Dec")

Quarterly patterns

Quarter keywords expand to all three constituent months:

# "Q4 spike" → months 10, 11, 12 all boosted by 1.3×
misata.generate("Ecommerce orders — Q4 spike, Q1 slump")

# "strong Q4" → months 10, 11, 12 lifted by 1.15×
misata.generate("SaaS revenue — strong Q4, flat Q2")

# Quarter-level anchors
misata.generate("SaaS mrr — $100k in Q1, $150k in Q2, $200k in Q3, $250k in Q4")
PatternMonths affectedFactor
Q1 dip / slumpJan, Feb, Mar0.7×
Q2 flatApr, May, Jun1.0×
Q3 peak / spikeJul, Aug, Sep1.25–1.3×
Q4 push / strongOct, Nov, Dec1.15–1.2×

Named seasonal events

misata.generate("Ecommerce orders — Black Friday spike, Christmas peak")
misata.generate("EdTech enrollments — back to school surge")
misata.generate("SaaS signups — New Year spike, summer slump")
Event phraseMonthFactor
Black FridayNovember1.55×
Cyber Monday / Cyber WeekNovember1.4–1.45×
Christmas / XmasDecember1.4×
Holiday season / Festive seasonDecember1.3–1.35×
New YearJanuary1.25×
ValentineFebruary1.2×
Tax seasonApril1.2×
Back to schoolAugust1.2×
Summer slump / Slow summerJuly + August0.75× each

Relative multipliers

When you know the end-state but not the absolute numbers, use a multiplier:

# Pure multiplier — Misata derives a sensible baseline and scales it
misata.generate("SaaS startup — MRR 10x growth over the year")
misata.generate("Fintech transaction volume doubled over the year")
misata.generate("Ecommerce GMV tripled in one year")

# Multiplier + one anchor — uses the anchor as the pivot
# Jan is pinned at $50k; Dec is derived as $100k (2× Jan)
misata.generate("SaaS mrr $50k in January, doubled by December")

# Halved (decline story)
misata.generate("SaaS revenue halved after the pivot")
Word formFactor
halved0.5×
doubled / 2x
tripled / 3x
quadrupled / 4x
5x / 10x5× / 10×
grew 300%4× (1 + 3.0)

Qualitative month modifiers

misata.generate("SaaS mrr — dip in March, peak in November")
misata.generate("Ecommerce orders — slump in January, boom in December")
KeywordFactor
crash0.5×
dip / drop / slump0.7–0.72×
decline0.75×
slow / low0.8×
flat1.0×
strong / push1.15–1.2×
high1.2×
peak1.25×
boom / spike / surge1.3×

Trigger tokens

A curve is only built when the story contains at least one of these signal words:

revenue, sales, mrr, arr, gmv, amount, orders, bookings, transactions, volume, churn, growth, peak, dip, spike, surge, drop, decline, slump, boom, doubled, tripled, halved, black friday, christmas, summer slump, q1, q2, q3, q4


All 18 domains

DomainTrigger keywordsTables
saassaas, subscription, mrr, arr, churnusers, subscriptions, invoices
ecommerceecommerce, orders, store, retail, cartcustomers, products, orders, order_items
fintechfintech, payments, banking, fraud, walletcustomers, accounts, transactions
healthcarehealthcare, patients, doctors, clinic, hospitaldoctors, patients, appointments
marketplacemarketplace, sellers, buyers, listings, freelancesellers, buyers, listings, orders
logisticslogistics, shipping, drivers, fleet, routesdrivers, vehicles, routes, shipments
hrhr, employees, payroll, workforce, headcountdepartments, employees, payroll
socialsocial media, instagram, tiktok, followers, feedusers, posts, follows, reactions, comments
realestatereal estate, housing, mortgage, listingsagents, properties, transactions
pharmapharma, clinical, trials, researchresearchers, projects, trials, timesheets
fooddeliveryfood delivery, restaurants, takeout, doordashrestaurants, customers, couriers, orders, order_items
edtechedtech, courses, students, enrollments, lmsinstructors, courses, students, enrollments, quiz_attempts
gaminggaming, players, leaderboard, esports, matchesplayers, matches, sessions, achievements
crmcrm, contacts, deals, pipeline, salesforcecompanies, contacts, deals, activities
cryptocrypto, blockchain, ethereum, defi, walletwallets, tokens, transactions, token_prices
insuranceinsurance, policy, claims, premiumcustomers, policies, claims, payments
traveltravel, hotel, flights, bookings, airbnbusers, hotels, flights, bookings, reviews
streamingstreaming, netflix, subscribers, watch historysubscribers, content, watch_history, ratings

Detailed domain reference with column listings →


Step-by-step: inspect then generate

import misata

# Step 1 — preview (zero rows generated)
report = misata.preview("A fintech with 5k customers, Black Friday spike", rows=5000)
if report.domain_confidence == "none":
    print("⚠ No domain detected")
    print(report.warnings)

# Step 2 — inspect full schema
schema = misata.parse("A fintech with 5k customers, Black Friday spike", rows=5000)
print(schema.summary())
# Tables: customers, accounts, transactions
# Outcome curves: 1 (transactions.amount, monthly)

# Step 3 — generate
tables = misata.generate_from_schema(schema, seed=42)
print(tables["transactions"].head())

Tips

Be explicit about scale: "5k users" is always clearer than "a medium-sized company".

Name the domain: "A fintech company with..." always wins over a story that only uses secondary keywords.

Combine anchors freely: Monthly anchors, quarter patterns, named events, and multipliers can all appear in the same story. Named events and quarter patterns stack multiplicatively.

Use seed for reproducibility: Same seed + same story = byte-identical output every time.

Switch to LLM for open-ended stories: If your story doesn't fit any of the 18 domains, LLMSchemaGenerator can interpret it using a large language model:

from misata import LLMSchemaGenerator
gen = LLMSchemaGenerator(provider="groq")
schema = gen.generate_from_story("A B2B API platform with rate limits and invoicing")
tables = misata.generate_from_schema(schema)