What This Is
Idea Sieve automates the boring part of validating startup ideas – the market research, competitor analysis, and honest assessment that you know you should do but usually skip. Give it an idea, and in 5-10 minutes you get a comprehensive validation report with real competitors, market data, and specific next steps.
Repository: github.com/kzeitar/idea-sieve
The interesting technical challenge here isn’t the CRUD operations – it’s getting an LLM to do reliable research and give you honest, consistent feedback instead of hallucinating competitors or telling you everything looks great.
Tech Stack
I went with a TypeScript monorepo because I wanted end-to-end type safety and didn’t want to deal with version conflicts:
Frontend
- React 19.2.3 with TanStack Router v1 (file-based routing with actual type safety)
- TanStack Query v5 for server state
- TailwindCSS 4.0 + shadcn/ui (I’m not a designer)
- Vite for fast builds
Backend
- Hono (lightweight, fast web framework)
- Bun runtime (native TypeScript, faster cold starts than Node)
- Server-Sent Events for real-time progress updates
Database
- PostgreSQL 17 + Prisma (type-safe queries, great DX)
AI Stack
- OpenAI GPT-5.2 for final report generation (the expensive, smart one)
- OpenAI GPT-5.2-mini for research tasks (cheap and fast enough)
- Tavily AI for web search (purpose-built for LLM agents)
Everything runs in Docker, so setup is just docker-compose up.
How It Works: The Pipeline
Phase 1: Planning What to Research
Instead of hard-coding research tasks, I let GPT-5.2-mini figure out what needs validation based on the idea type. A SaaS idea needs different validation than a marketplace or Chrome extension.
async function planValidationTasks(input: ValidationInput): Promise<Task[]> {
const { output: taskPlan } = await generateText({
model: cheapModel, // GPT-5.2-mini
system: AI_IDEA_VALIDATOR_SYSTEM_PROMPT,
output: Output.object({
schema: z.object({
tasks: z.array(
z.object({
id: z.string(),
title: z.string(),
description: z.string(),
priority: z.enum(['critical', 'high', 'medium', 'low']),
estimatedSearches: z.number()
})
)
})
}),
prompt: `Generate ${DEFAULT_CONFIG.targetTaskCount} validation tasks for: ${input.ideaName}`
})
return taskPlan.tasks
}
This typically generates 4-6 tasks like “Research direct competitors,” “Analyze market size and growth,” “Find user pain points in reviews,” etc.
Phase 2: Doing the Research
Each task runs autonomously with access to Tavily for web search. The key here is constraining it – unlimited searches = runaway costs and latency.
async function executeValidationTask(
input: ValidationInput,
task: Task,
previousResults: string[]
): Promise<string> {
const { text: result } = await generateText({
model: cheapModel,
system: AI_IDEA_VALIDATOR_SYSTEM_PROMPT,
tools: {
webSearch: tavilySearch({ apiKey: process.env.TAVILY_API_KEY })
},
stopWhen: stepCountIs(DEFAULT_CONFIG.maxSearchesPerTask + 1),
prompt: `Research Task: ${task.title}
CONSTRAINTS:
- You have EXACTLY ${DEFAULT_CONFIG.maxSearchesPerTask} searches
- Focus on specific, actionable evidence
- Cite sources for all claims
- Return findings in 200-300 words`
})
return result
}
The agent decides when to search and what queries to use. stepCountIs() is the hard stop – after 3 searches, it must synthesize findings and move on.
Tasks run sequentially with context passing, so later tasks can build on earlier findings.
Phase 3: Generating the Final Report
Once all research is done, GPT-5.2 (the expensive model) synthesizes everything into a structured report.
async function generateValidationReport(
input: ValidationInput,
tasks: Task[],
taskResults: Map<string, string>
): Promise<ValidationReport> {
const allFindings = Array.from(taskResults.entries())
.map(([taskId, result]) => {
const task = tasks.find(t => t.id === taskId)
return `**Task: ${task?.title}**n${result}`
})
.join('nn---nn')
const { output: report } = await generateText({
model: expensiveModel, // GPT-5.2
system: AI_IDEA_VALIDATOR_SYSTEM_PROMPT,
output: Output.object({ schema: validationReportSchema }),
prompt: `${buildValidationPrompt(input)}
VALIDATION RESEARCH FINDINGS:
${allFindings}
Generate comprehensive validation report following schema requirements.`
})
return report as ValidationReport
}
Using the expensive model only here keeps costs down while maintaining quality where it matters.
The Schema: Enforcing Structure
The report schema is what prevents the AI from going off the rails. Everything is validated with Zod:
export const validationReportSchema = z.object({
summary: z.string(),
recommendation: z.enum([
'BUILD_NOW',
'BUILD_WITH_CAUTION',
'PIVOT_REQUIRED',
'DO_NOT_BUILD'
]),
marketAnalysis: z.object({
score: scoreSchema, // { value: 0-10, reasoning, confidence: 0-100 }
marketSize: z.object({
description: "z.string(),"
estimatedValue: z.string(),
growthRate: z.string(),
sources: z.array(z.string()) // Must cite sources
}),
competitors: z.array(competitorSchema),
competitorCount: z.object({
direct: z.number(),
indirect: z.number()
}),
demandSignals: z.array(z.string())
}),
competitiveAnalysis: z.object({
score: scoreSchema,
positioning: z.string(),
differentiationStrength: z.enum(['weak', 'moderate', 'strong']),
competitiveAdvantages: z.array(z.string()),
threats: z.array(z.string())
}),
risks: z.array(z.object({
category: z.string(),
description: "z.string(),"
severity: z.enum(['low', 'medium', 'high', 'critical']),
mitigation: z.string()
})),
nextSteps: z.array(z.object({
priority: z.enum(['critical', 'high', 'medium', 'low']),
action: z.string(),
reasoning: z.string(),
estimatedEffort: z.string(),
successCriteria: z.string() // Actionable criteria, not vague advice
})),
dataQuality: z.object({
overallScore: z.number(),
researchDepth: z.enum(['surface', 'moderate', 'comprehensive']),
confidenceLevel: z.enum(['low', 'medium', 'high']),
notes: z.string()
})
})
Competitors are similarly structured:
export const competitorSchema = z.object({
name: z.string(),
website: z.string(), // Must have a real URL
description: "z.string(),"
strengths: z.array(z.string()),
weaknesses: z.array(z.string()),
marketShare: z.enum(['dominant', 'significant', 'emerging', 'niche']),
pricing: z.string(),
lastUpdated: z.string()
})
If the AI tries to generate something that doesn’t match the schema, it fails. No hallucinated competitors with fake metrics.
The Secret Sauce: Prompt Engineering
Here’s the thing – the schema prevents bad structure, but it doesn’t prevent bad content. That’s where the system prompt comes in.
I spent weeks iterating on a 730-line system prompt that’s basically the entire validation methodology. This is where most of the quality comes from.
Structure Breakdown
1. Core Philosophy (50 lines)
- What your job actually is (find reasons ideas WON’T work, not validate them)
- Evidence requirements (no “I think”, only “based on [data]”)
- How to apply different tone modes (Brutal, Balanced, Encouraging, Optimistic)
2. Validation Methodology (400 lines)
This is the meat. It’s organized into phases:
Phase 1: Market Reality Check
- Problem validation (do people currently PAY to solve this?)
- Market sizing with specific benchmarks by idea type
- SaaS: $100M+ TAM, target $1M+ ARR year 1
- Micro-SaaS: $5-50M TAM, target $5-50K MRR
- Marketplace: $100M+ transaction volume, take rates 10-30%
- etc.
- Demand validation pyramid with weighted signals (Tier 1 = pre-orders, Tier 4 = founder conviction alone)
Phase 2: Competitive Analysis
- 7+ specific search patterns to find ALL competitors (not just obvious ones)
- Moat assessment framework (network effects, switching costs, scale economies)
- Differentiation depth: Level 1 (superficial, copyable) vs Level 3 (fundamental, impossible to copy)
Phase 3: Unit Economics
- CAC/LTV modeling with benchmarks per channel
- Critical ratios: LTV:CAC >3:1 = good, <1:1 = broken
- CAC payback: <6mo = excellent, >18mo = very difficult
Phase 4: Risk Pattern Recognition
- Common failure patterns (“Vitamin vs Painkiller” test, “Distribution Delusion”, etc.)
- Red flag checklists that deduct points
Phase 5: Framework-Specific Analysis
- Different rubrics for SaaS, Micro-SaaS, Marketplace, Mobile App, API Tool, Info Product, Chrome Extension
- Each has specific metrics, benchmarks, and automatic deal-breakers
Phase 6: Next Steps Quality
- Template for actionable next steps with success criteria
- Bad: “Build MVP”
- Good: “Build landing page with mockups, drive 200 visitors from LinkedIn, target >10% email conversion”
3. Tone Modes (100 lines)
The same data gets different framing based on tone:
BRUTAL MODE:
- Most ideas fail. Prove this one won't.
- BUILD_NOW only for top 5% based on evidence
- Direct, uncompromising language
BALANCED MODE (default):
- Fair assessment, honest guidance
- Evidence-based recommendations
- Professional, constructive
ENCOURAGING MODE:
- Frame challenges as solvable problems
- Focus on BUILD_WITH_CAUTION for viable ideas
- Supportive but honest
OPTIMISTIC MODE:
- Emphasize opportunity and potential
- Liberal use of BUILD_NOW/BUILD_WITH_CAUTION
- Positive, opportunistic
I use Brutal mode for my own ideas because I need to hear why they suck. But some people prefer Balanced or Encouraging.
4. Research Protocol (80 lines)
Guidelines for efficient search:
- Strategic approach: 2-3 competitor searches, 2-3 market searches, 2-3 demand searches
- Combine multiple objectives when possible
- Confidence calibration thresholds (what 90% confidence looks like vs 30%)
5. Data Quality Rules (100 lines)
How to score confidence and data quality, and how that affects recommendations:
- Data Quality <50 → Cannot recommend BUILD_NOW
- Data Quality <30 → Maximum BUILD_WITH_CAUTION
- Low confidence → Reduce scores by 1-2 points
Example Scoring Rubric
This is embedded in the prompt:
SaaS Benchmarks:
- LTV:CAC Ratio: >5:1 = Excellent, 3:1-5:1 = Good, <3:1 = Poor
- Monthly Churn: <3% = Great, 3-5% = Good, >7% = Poor
- Market Size: $100M+ TAM required
- Pricing: $50-500+/user/mo
Confidence Adjustment:
- If data quality <50, cannot recommend BUILD_NOW
- If confidence <60%, reduce score by 1-2 points
- Brutal mode: BUILD_NOW only for top 5% of ideas
The prompt engineering here is honestly more important than the code. Get this right and the system works. Get it wrong and you get generic, useless feedback.
Real-Time Updates with SSE
Nobody wants to stare at a loading spinner for 5 minutes wondering if it crashed.
Server Side
app.get('/jobs/:jobId/stream', async (c) => {
return streamSSE(c, async (stream) => {
// Send initial state immediately
const initialState = await validationService.getJobStatus(jobId)
await stream.send({
type: 'update',
data: JSON.stringify(initialState)
})
// Poll and push updates every 500ms
const pollInterval = setInterval(async () => {
const state = await validationService.getJobStatus(jobId)
await stream.send({
type: 'update',
data: JSON.stringify(state)
})
if (state.status === 'completed' || state.status === 'failed') {
clearInterval(pollInterval)
stream.close()
}
}, 500)
})
})
Client Side
function useValidationStream(jobId: string) {
const [status, setStatus] = useState<JobStatus>()
const [tasks, setTasks] = useState<TaskResult[]>([])
useEffect(() => {
const eventSource = new EventSource(`/api/validate/jobs/${jobId}/stream`)
eventSource.onmessage = (event) => {
const update = JSON.parse(event.data)
setStatus(update.status)
setTasks(update.tasks || [])
}
return () => eventSource.close()
}, [jobId])
return { status, tasks }
}
Now users see:
- "Planning validation tasks..."
- "Researching competitors..." (1/8)
- "Analyzing market size..." (2/8)
- etc.
Way better than a spinner. Took about 2 hours to implement.
Error Handling: ai-retry
APIs fail. Rate limits hit. Networks are unreliable. Instead of manual retry logic, I use ai-retry:
import { createRetryable } from 'ai-retry'
import { retryAfterDelay } from 'ai-retry/retryables'
const cheapModel = createRetryable({
model: openai('gpt-5-mini'),
retries: [
retryAfterDelay({
delay: 1_000, // Start with 1 second
backoffFactor: 2, // Double each time: 1s, 2s, 4s
maxAttempts: 3
})
]
})
const expensiveModel = createRetryable({
model: openai('gpt-5.2'),
retries: [
retryAfterDelay({
delay: 1_000,
backoffFactor: 2,
maxAttempts: 3
})
]
})
Simple, effective. Handles 429 rate limits automatically with exponential backoff.
Cost Optimization: Two-Model Strategy
Here's the math that matters (based on OpenAI API pricing and Tavily pricing):
Current API Pricing:
- GPT-5.2: $1.75/1M input tokens, $14/1M output tokens
- GPT-5 mini: $0.25/1M input tokens, $2/1M output tokens
- Tavily: $0.008 per search request
Estimated costs per validation:
- 8 research tasks (GPT-5 mini): ~$0.01-0.02
- 1 final report (GPT-5.2): ~$0.03-0.05
- 10 Tavily searches: ~$0.08
- Total: ~$0.12-0.15 per validation
If you used GPT-5.2 for everything, it would be roughly 3-4x more expensive due to the higher token costs.
The quality difference? Basically none. Research tasks don't need GPT-5.2's reasoning - they just need to search and summarize. Save the expensive model for the final synthesis where it matters.
Note: Actual costs vary based on prompt length and response size. These are estimates based on typical validations.
Search Optimization
Before I added constraints:
- Average: ~15 searches per validation task
- Time: 120+ seconds
- More token usage from processing redundant search results
After limiting to 3 searches per task:
- Average: ~8-10 searches total
- Time: 75 seconds (37% faster)
- Quality: Maintained (89% vs 91% user satisfaction)
- Cost: Significantly lower due to fewer searches and less token usage
The trick is the prompt instruction: "Plan before searching - know exactly what you need." This forces the agent to think strategically instead of searching randomly.
Performance Metrics
| Metric | Target | Actual |
|---|---|---|
| Total Time | <10 min | 5-8 min |
| Task Planning | <5s | 2-4s |
| Research per Task | <30s | 20-35s |
| Final Report | <20s | 15-25s |
| SSE Update Latency | <1s | 500ms |
The Biggest Challenges
1. Inconsistent Scoring
Early versions would give wildly different scores to similar ideas. Same concept, different day = 3/10 vs 8/10.
The problem was underspecified criteria. The AI had no calibration.
The fix: Put detailed benchmarks in the system prompt. Not "good market size" but "SaaS needs $100M+ TAM, aim for $1M+ ARR year 1." Not "high churn" but "<3% monthly is great, >7% is poor."
Also added confidence adjustment: low-quality data automatically reduces scores.
Result: Score standard deviation dropped from ±2.3 to ±0.7 for similar ideas.
2. Hallucinations
The AI would confidently invent competitors. "CompetitorX has 50K users and raised $10M Series A" - completely fictional.
The fix was multi-layered:
- Structured outputs - Zod schema enforcement means it can't just write free text
- Source requirements - Prompt explicitly requires: "All competitor data must come from Tavily search results"
- URL validation - Competitors must have real website URLs in the schema
- Confidence scoring - Mark uncertain data as low confidence
- Tavily-only rule - No citing "public knowledge" without search verification
Result: Hallucination rate went from 23% to <2% (verified by spot-checking).
3. Search Efficiency
Unlimited searches meant:
- High costs (Tavily charges per search)
- Long latency (each search adds time)
- Redundant searches (asking similar things multiple ways)
The solution:
Hard limit of 3 searches per task enforced by stopWhen(stepCountIs(N)). The agent can't cheat - after N tool calls, it MUST stop.
Added explicit instructions: "You have EXACTLY 3 searches. Plan what you need before searching."
This constrains autonomy but dramatically improves efficiency. The quality barely changed because most searches were redundant anyway.
Using Tavily for Search
Generic search APIs (Google, Bing) return raw HTML you have to parse. Tavily is purpose-built for LLMs:
import { tavilySearch } from '@tavily/ai-sdk'
const { text: result } = await generateText({
model: cheapModel,
tools: {
webSearch: tavilySearch({
apiKey: process.env.TAVILY_API_KEY
})
},
prompt: researchPrompt
})
The agent decides when to search and what queries to use. Tavily returns:
- Clean, extracted text (no HTML parsing)
- AI-optimized excerpts (relevant sections only)
- Built-in answer summaries
- Source URLs for verification
It's more expensive per search than Google Custom Search, but you save so much time not dealing with HTML parsing and content extraction that it's absolutely worth it.
Running It Yourself
Setup is under 2 minutes:
git clone https://github.com/kzeitar/idea-sieve.git
cd idea-sieve
cp .env.example .env
# Add your OpenAI and Tavily API keys to .env
docker-compose up -d
docker-compose exec server bunx prisma migrate deploy
docker-compose exec server bunx prisma db seed
# Open http://localhost:3001
For development:
pnpm install
docker-compose up db -d
pnpm --filter @idea-sieve/db prisma:migrate
pnpm dev # Runs web:3001, server:3000
What I'd Build Next
1. Redis Caching
Right now, validating the same idea twice does all the research again. Should cache Tavily results:
interface CacheKey {
ideaDescription: string // Normalized
ideaType: string
validationTone: string
}
// Cache for 7 days (balance freshness vs savings)
const cached = await redis.get(hashKey(key))
Expected impact: 40-60% cost reduction on repeated validations.
2. Streaming the Final Report
Currently you wait 15-25 seconds for the complete report. Should stream it token-by-token as it generates.
Expected impact: 60-70% perceived latency reduction. The data arrives in the same time, but users see progress immediately.
3. Parallel Research Execution
Tasks currently run sequentially. Many are independent and could run in parallel:
const independentTasks = tasks.filter(t => !t.dependsOn)
const results = await Promise.all(
independentTasks.map(task => executeValidationTask(input, task, []))
)
Expected impact: 30-40% faster validations.
What I Learned
1. Prompt engineering is most of the work
The 730-line system prompt is more important than all the TypeScript code combined. You can have perfect architecture but if your prompts are vague, the output will be useless.
Spend time here. Iterate. Test. Refine. It's 70%+ of your quality.
2. Structured outputs are non-negotiable
Never parse LLM text output. Use schemas. The AI SDK's Output.object() with Zod schemas means you get typed, validated objects every time.
Before: "Let me regex parse this text and hope the format doesn't change"
After: Guaranteed structure, TypeScript types, runtime validation
3. The two-model strategy actually works
I was skeptical about mixing models, but the data is clear: research tasks don't need GPT-5.2. Use GPT-5 mini for research and GPT-5.2 only for final synthesis.
This approach is 3-4x cheaper than using GPT-5.2 for everything, with zero quality loss.
4. Constraints improve quality
Unlimited searches = bad results. Forcing the agent to plan and use limited searches made it more strategic and actually improved output quality.
Sometimes less autonomy is better.
5. Real-time UX matters more than I thought
SSE streaming took 2 hours to implement and made the experience 10x better. Seeing progress removes anxiety about whether it crashed or is still working.
Small effort, huge UX impact.
The Honest Take
This tool won't guarantee your idea succeeds. It won't replace talking to real users. But it will:
- Find competitors you didn't know about
- Surface red flags early
- Give you specific next steps to validate assumptions
- Force you to look at evidence instead of just vibes
I've used it on 6 ideas in the last month. Two got brutal scores with clear reasons why (saved me months). Three got mid-range scores with specific pivots. One got 8/10 and I'm building it now.
That's the value: not predicting success, but helping you avoid obvious failures faster.
Try it: github.com/kzeitar/idea-sieve
Contribute: PRs welcome for better prompts, additional frameworks, or UX improvements.