Building an AI-Powered (SaaS, App, etc) Idea Validation System

What This Is

Idea Sieve automates the boring part of validating startup ideas – the market research, competitor analysis, and honest assessment that you know you should do but usually skip. Give it an idea, and in 5-10 minutes you get a comprehensive validation report with real competitors, market data, and specific next steps.

Repository: github.com/kzeitar/idea-sieve

The interesting technical challenge here isn’t the CRUD operations – it’s getting an LLM to do reliable research and give you honest, consistent feedback instead of hallucinating competitors or telling you everything looks great.

Tech Stack

I went with a TypeScript monorepo because I wanted end-to-end type safety and didn’t want to deal with version conflicts:

Frontend

  • React 19.2.3 with TanStack Router v1 (file-based routing with actual type safety)
  • TanStack Query v5 for server state
  • TailwindCSS 4.0 + shadcn/ui (I’m not a designer)
  • Vite for fast builds

Backend

  • Hono (lightweight, fast web framework)
  • Bun runtime (native TypeScript, faster cold starts than Node)
  • Server-Sent Events for real-time progress updates

Database

  • PostgreSQL 17 + Prisma (type-safe queries, great DX)

AI Stack

  • OpenAI GPT-5.2 for final report generation (the expensive, smart one)
  • OpenAI GPT-5.2-mini for research tasks (cheap and fast enough)
  • Tavily AI for web search (purpose-built for LLM agents)

Everything runs in Docker, so setup is just docker-compose up.

How It Works: The Pipeline

Phase 1: Planning What to Research

Instead of hard-coding research tasks, I let GPT-5.2-mini figure out what needs validation based on the idea type. A SaaS idea needs different validation than a marketplace or Chrome extension.

async function planValidationTasks(input: ValidationInput): Promise<Task[]> {
  const { output: taskPlan } = await generateText({
    model: cheapModel, // GPT-5.2-mini
    system: AI_IDEA_VALIDATOR_SYSTEM_PROMPT,
    output: Output.object({
      schema: z.object({
        tasks: z.array(
          z.object({
            id: z.string(),
            title: z.string(),
            description: z.string(),
            priority: z.enum(['critical', 'high', 'medium', 'low']),
            estimatedSearches: z.number()
          })
        )
      })
    }),
    prompt: `Generate ${DEFAULT_CONFIG.targetTaskCount} validation tasks for: ${input.ideaName}`
  })

  return taskPlan.tasks
}

This typically generates 4-6 tasks like “Research direct competitors,” “Analyze market size and growth,” “Find user pain points in reviews,” etc.

Phase 2: Doing the Research

Each task runs autonomously with access to Tavily for web search. The key here is constraining it – unlimited searches = runaway costs and latency.

async function executeValidationTask(
  input: ValidationInput,
  task: Task,
  previousResults: string[]
): Promise<string> {
  const { text: result } = await generateText({
    model: cheapModel,
    system: AI_IDEA_VALIDATOR_SYSTEM_PROMPT,
    tools: {
      webSearch: tavilySearch({ apiKey: process.env.TAVILY_API_KEY })
    },
    stopWhen: stepCountIs(DEFAULT_CONFIG.maxSearchesPerTask + 1),
    prompt: `Research Task: ${task.title}

CONSTRAINTS:
- You have EXACTLY ${DEFAULT_CONFIG.maxSearchesPerTask} searches
- Focus on specific, actionable evidence
- Cite sources for all claims
- Return findings in 200-300 words`
  })

  return result
}

The agent decides when to search and what queries to use. stepCountIs() is the hard stop – after 3 searches, it must synthesize findings and move on.

Tasks run sequentially with context passing, so later tasks can build on earlier findings.

Phase 3: Generating the Final Report

Once all research is done, GPT-5.2 (the expensive model) synthesizes everything into a structured report.

async function generateValidationReport(
  input: ValidationInput,
  tasks: Task[],
  taskResults: Map<string, string>
): Promise<ValidationReport> {
  const allFindings = Array.from(taskResults.entries())
    .map(([taskId, result]) => {
      const task = tasks.find(t => t.id === taskId)
      return `**Task: ${task?.title}**n${result}`
    })
    .join('nn---nn')

  const { output: report } = await generateText({
    model: expensiveModel, // GPT-5.2
    system: AI_IDEA_VALIDATOR_SYSTEM_PROMPT,
    output: Output.object({ schema: validationReportSchema }),
    prompt: `${buildValidationPrompt(input)}

VALIDATION RESEARCH FINDINGS:
${allFindings}

Generate comprehensive validation report following schema requirements.`
  })

  return report as ValidationReport
}

Using the expensive model only here keeps costs down while maintaining quality where it matters.

The Schema: Enforcing Structure

The report schema is what prevents the AI from going off the rails. Everything is validated with Zod:

export const validationReportSchema = z.object({
  summary: z.string(),
  recommendation: z.enum([
    'BUILD_NOW',
    'BUILD_WITH_CAUTION',
    'PIVOT_REQUIRED',
    'DO_NOT_BUILD'
  ]),

  marketAnalysis: z.object({
    score: scoreSchema, // { value: 0-10, reasoning, confidence: 0-100 }
    marketSize: z.object({
      description: "z.string(),"
      estimatedValue: z.string(),
      growthRate: z.string(),
      sources: z.array(z.string()) // Must cite sources
    }),
    competitors: z.array(competitorSchema),
    competitorCount: z.object({
      direct: z.number(),
      indirect: z.number()
    }),
    demandSignals: z.array(z.string())
  }),

  competitiveAnalysis: z.object({
    score: scoreSchema,
    positioning: z.string(),
    differentiationStrength: z.enum(['weak', 'moderate', 'strong']),
    competitiveAdvantages: z.array(z.string()),
    threats: z.array(z.string())
  }),

  risks: z.array(z.object({
    category: z.string(),
    description: "z.string(),"
    severity: z.enum(['low', 'medium', 'high', 'critical']),
    mitigation: z.string()
  })),

  nextSteps: z.array(z.object({
    priority: z.enum(['critical', 'high', 'medium', 'low']),
    action: z.string(),
    reasoning: z.string(),
    estimatedEffort: z.string(),
    successCriteria: z.string() // Actionable criteria, not vague advice
  })),

  dataQuality: z.object({
    overallScore: z.number(),
    researchDepth: z.enum(['surface', 'moderate', 'comprehensive']),
    confidenceLevel: z.enum(['low', 'medium', 'high']),
    notes: z.string()
  })
})

Competitors are similarly structured:

export const competitorSchema = z.object({
  name: z.string(),
  website: z.string(), // Must have a real URL
  description: "z.string(),"
  strengths: z.array(z.string()),
  weaknesses: z.array(z.string()),
  marketShare: z.enum(['dominant', 'significant', 'emerging', 'niche']),
  pricing: z.string(),
  lastUpdated: z.string()
})

If the AI tries to generate something that doesn’t match the schema, it fails. No hallucinated competitors with fake metrics.

The Secret Sauce: Prompt Engineering

Here’s the thing – the schema prevents bad structure, but it doesn’t prevent bad content. That’s where the system prompt comes in.

I spent weeks iterating on a 730-line system prompt that’s basically the entire validation methodology. This is where most of the quality comes from.

Structure Breakdown

1. Core Philosophy (50 lines)

  • What your job actually is (find reasons ideas WON’T work, not validate them)
  • Evidence requirements (no “I think”, only “based on [data]”)
  • How to apply different tone modes (Brutal, Balanced, Encouraging, Optimistic)

2. Validation Methodology (400 lines)

This is the meat. It’s organized into phases:

Phase 1: Market Reality Check

  • Problem validation (do people currently PAY to solve this?)
  • Market sizing with specific benchmarks by idea type
    • SaaS: $100M+ TAM, target $1M+ ARR year 1
    • Micro-SaaS: $5-50M TAM, target $5-50K MRR
    • Marketplace: $100M+ transaction volume, take rates 10-30%
    • etc.
  • Demand validation pyramid with weighted signals (Tier 1 = pre-orders, Tier 4 = founder conviction alone)

Phase 2: Competitive Analysis

  • 7+ specific search patterns to find ALL competitors (not just obvious ones)
  • Moat assessment framework (network effects, switching costs, scale economies)
  • Differentiation depth: Level 1 (superficial, copyable) vs Level 3 (fundamental, impossible to copy)

Phase 3: Unit Economics

  • CAC/LTV modeling with benchmarks per channel
  • Critical ratios: LTV:CAC >3:1 = good, <1:1 = broken
  • CAC payback: <6mo = excellent, >18mo = very difficult

Phase 4: Risk Pattern Recognition

  • Common failure patterns (“Vitamin vs Painkiller” test, “Distribution Delusion”, etc.)
  • Red flag checklists that deduct points

Phase 5: Framework-Specific Analysis

  • Different rubrics for SaaS, Micro-SaaS, Marketplace, Mobile App, API Tool, Info Product, Chrome Extension
  • Each has specific metrics, benchmarks, and automatic deal-breakers

Phase 6: Next Steps Quality

  • Template for actionable next steps with success criteria
  • Bad: “Build MVP”
  • Good: “Build landing page with mockups, drive 200 visitors from LinkedIn, target >10% email conversion”

3. Tone Modes (100 lines)

The same data gets different framing based on tone:

BRUTAL MODE:
- Most ideas fail. Prove this one won't.
- BUILD_NOW only for top 5% based on evidence
- Direct, uncompromising language

BALANCED MODE (default):
- Fair assessment, honest guidance
- Evidence-based recommendations
- Professional, constructive

ENCOURAGING MODE:
- Frame challenges as solvable problems
- Focus on BUILD_WITH_CAUTION for viable ideas
- Supportive but honest

OPTIMISTIC MODE:
- Emphasize opportunity and potential
- Liberal use of BUILD_NOW/BUILD_WITH_CAUTION
- Positive, opportunistic

I use Brutal mode for my own ideas because I need to hear why they suck. But some people prefer Balanced or Encouraging.

4. Research Protocol (80 lines)

Guidelines for efficient search:

  • Strategic approach: 2-3 competitor searches, 2-3 market searches, 2-3 demand searches
  • Combine multiple objectives when possible
  • Confidence calibration thresholds (what 90% confidence looks like vs 30%)

5. Data Quality Rules (100 lines)

How to score confidence and data quality, and how that affects recommendations:

  • Data Quality <50 → Cannot recommend BUILD_NOW
  • Data Quality <30 → Maximum BUILD_WITH_CAUTION
  • Low confidence → Reduce scores by 1-2 points

Example Scoring Rubric

This is embedded in the prompt:

SaaS Benchmarks:
- LTV:CAC Ratio: >5:1 = Excellent, 3:1-5:1 = Good, <3:1 = Poor
- Monthly Churn: <3% = Great, 3-5% = Good, >7% = Poor
- Market Size: $100M+ TAM required
- Pricing: $50-500+/user/mo

Confidence Adjustment:
- If data quality <50, cannot recommend BUILD_NOW
- If confidence <60%, reduce score by 1-2 points
- Brutal mode: BUILD_NOW only for top 5% of ideas

The prompt engineering here is honestly more important than the code. Get this right and the system works. Get it wrong and you get generic, useless feedback.

Real-Time Updates with SSE

Nobody wants to stare at a loading spinner for 5 minutes wondering if it crashed.

Server Side

app.get('/jobs/:jobId/stream', async (c) => {
  return streamSSE(c, async (stream) => {
    // Send initial state immediately
    const initialState = await validationService.getJobStatus(jobId)
    await stream.send({
      type: 'update',
      data: JSON.stringify(initialState)
    })

    // Poll and push updates every 500ms
    const pollInterval = setInterval(async () => {
      const state = await validationService.getJobStatus(jobId)
      await stream.send({
        type: 'update',
        data: JSON.stringify(state)
      })

      if (state.status === 'completed' || state.status === 'failed') {
        clearInterval(pollInterval)
        stream.close()
      }
    }, 500)
  })
})

Client Side

function useValidationStream(jobId: string) {
  const [status, setStatus] = useState<JobStatus>()
  const [tasks, setTasks] = useState<TaskResult[]>([])

  useEffect(() => {
    const eventSource = new EventSource(`/api/validate/jobs/${jobId}/stream`)

    eventSource.onmessage = (event) => {
      const update = JSON.parse(event.data)
      setStatus(update.status)
      setTasks(update.tasks || [])
    }

    return () => eventSource.close()
  }, [jobId])

  return { status, tasks }
}

Now users see:

  • "Planning validation tasks..."
  • "Researching competitors..." (1/8)
  • "Analyzing market size..." (2/8)
  • etc.

Way better than a spinner. Took about 2 hours to implement.

Error Handling: ai-retry

APIs fail. Rate limits hit. Networks are unreliable. Instead of manual retry logic, I use ai-retry:

import { createRetryable } from 'ai-retry'
import { retryAfterDelay } from 'ai-retry/retryables'

const cheapModel = createRetryable({
  model: openai('gpt-5-mini'),
  retries: [
    retryAfterDelay({
      delay: 1_000,        // Start with 1 second
      backoffFactor: 2,    // Double each time: 1s, 2s, 4s
      maxAttempts: 3
    })
  ]
})

const expensiveModel = createRetryable({
  model: openai('gpt-5.2'),
  retries: [
    retryAfterDelay({
      delay: 1_000,
      backoffFactor: 2,
      maxAttempts: 3
    })
  ]
})

Simple, effective. Handles 429 rate limits automatically with exponential backoff.

Cost Optimization: Two-Model Strategy

Here's the math that matters (based on OpenAI API pricing and Tavily pricing):

Current API Pricing:

  • GPT-5.2: $1.75/1M input tokens, $14/1M output tokens
  • GPT-5 mini: $0.25/1M input tokens, $2/1M output tokens
  • Tavily: $0.008 per search request

Estimated costs per validation:

  • 8 research tasks (GPT-5 mini): ~$0.01-0.02
  • 1 final report (GPT-5.2): ~$0.03-0.05
  • 10 Tavily searches: ~$0.08
  • Total: ~$0.12-0.15 per validation

If you used GPT-5.2 for everything, it would be roughly 3-4x more expensive due to the higher token costs.

The quality difference? Basically none. Research tasks don't need GPT-5.2's reasoning - they just need to search and summarize. Save the expensive model for the final synthesis where it matters.

Note: Actual costs vary based on prompt length and response size. These are estimates based on typical validations.

Search Optimization

Before I added constraints:

  • Average: ~15 searches per validation task
  • Time: 120+ seconds
  • More token usage from processing redundant search results

After limiting to 3 searches per task:

  • Average: ~8-10 searches total
  • Time: 75 seconds (37% faster)
  • Quality: Maintained (89% vs 91% user satisfaction)
  • Cost: Significantly lower due to fewer searches and less token usage

The trick is the prompt instruction: "Plan before searching - know exactly what you need." This forces the agent to think strategically instead of searching randomly.

Performance Metrics

Metric Target Actual
Total Time <10 min 5-8 min
Task Planning <5s 2-4s
Research per Task <30s 20-35s
Final Report <20s 15-25s
SSE Update Latency <1s 500ms

The Biggest Challenges

1. Inconsistent Scoring

Early versions would give wildly different scores to similar ideas. Same concept, different day = 3/10 vs 8/10.

The problem was underspecified criteria. The AI had no calibration.

The fix: Put detailed benchmarks in the system prompt. Not "good market size" but "SaaS needs $100M+ TAM, aim for $1M+ ARR year 1." Not "high churn" but "<3% monthly is great, >7% is poor."

Also added confidence adjustment: low-quality data automatically reduces scores.

Result: Score standard deviation dropped from ±2.3 to ±0.7 for similar ideas.

2. Hallucinations

The AI would confidently invent competitors. "CompetitorX has 50K users and raised $10M Series A" - completely fictional.

The fix was multi-layered:

  1. Structured outputs - Zod schema enforcement means it can't just write free text
  2. Source requirements - Prompt explicitly requires: "All competitor data must come from Tavily search results"
  3. URL validation - Competitors must have real website URLs in the schema
  4. Confidence scoring - Mark uncertain data as low confidence
  5. Tavily-only rule - No citing "public knowledge" without search verification

Result: Hallucination rate went from 23% to <2% (verified by spot-checking).

3. Search Efficiency

Unlimited searches meant:

  • High costs (Tavily charges per search)
  • Long latency (each search adds time)
  • Redundant searches (asking similar things multiple ways)

The solution:

Hard limit of 3 searches per task enforced by stopWhen(stepCountIs(N)). The agent can't cheat - after N tool calls, it MUST stop.

Added explicit instructions: "You have EXACTLY 3 searches. Plan what you need before searching."

This constrains autonomy but dramatically improves efficiency. The quality barely changed because most searches were redundant anyway.

Generic search APIs (Google, Bing) return raw HTML you have to parse. Tavily is purpose-built for LLMs:

import { tavilySearch } from '@tavily/ai-sdk'

const { text: result } = await generateText({
  model: cheapModel,
  tools: {
    webSearch: tavilySearch({
      apiKey: process.env.TAVILY_API_KEY
    })
  },
  prompt: researchPrompt
})

The agent decides when to search and what queries to use. Tavily returns:

  • Clean, extracted text (no HTML parsing)
  • AI-optimized excerpts (relevant sections only)
  • Built-in answer summaries
  • Source URLs for verification

It's more expensive per search than Google Custom Search, but you save so much time not dealing with HTML parsing and content extraction that it's absolutely worth it.

Running It Yourself

Setup is under 2 minutes:

git clone https://github.com/kzeitar/idea-sieve.git
cd idea-sieve
cp .env.example .env
# Add your OpenAI and Tavily API keys to .env

docker-compose up -d
docker-compose exec server bunx prisma migrate deploy
docker-compose exec server bunx prisma db seed

# Open http://localhost:3001

For development:

pnpm install
docker-compose up db -d
pnpm --filter @idea-sieve/db prisma:migrate
pnpm dev  # Runs web:3001, server:3000

What I'd Build Next

1. Redis Caching

Right now, validating the same idea twice does all the research again. Should cache Tavily results:

interface CacheKey {
  ideaDescription: string  // Normalized
  ideaType: string
  validationTone: string
}

// Cache for 7 days (balance freshness vs savings)
const cached = await redis.get(hashKey(key))

Expected impact: 40-60% cost reduction on repeated validations.

2. Streaming the Final Report

Currently you wait 15-25 seconds for the complete report. Should stream it token-by-token as it generates.

Expected impact: 60-70% perceived latency reduction. The data arrives in the same time, but users see progress immediately.

3. Parallel Research Execution

Tasks currently run sequentially. Many are independent and could run in parallel:

const independentTasks = tasks.filter(t => !t.dependsOn)
const results = await Promise.all(
  independentTasks.map(task => executeValidationTask(input, task, []))
)

Expected impact: 30-40% faster validations.

What I Learned

1. Prompt engineering is most of the work

The 730-line system prompt is more important than all the TypeScript code combined. You can have perfect architecture but if your prompts are vague, the output will be useless.

Spend time here. Iterate. Test. Refine. It's 70%+ of your quality.

2. Structured outputs are non-negotiable

Never parse LLM text output. Use schemas. The AI SDK's Output.object() with Zod schemas means you get typed, validated objects every time.

Before: "Let me regex parse this text and hope the format doesn't change"
After: Guaranteed structure, TypeScript types, runtime validation

3. The two-model strategy actually works

I was skeptical about mixing models, but the data is clear: research tasks don't need GPT-5.2. Use GPT-5 mini for research and GPT-5.2 only for final synthesis.

This approach is 3-4x cheaper than using GPT-5.2 for everything, with zero quality loss.

4. Constraints improve quality

Unlimited searches = bad results. Forcing the agent to plan and use limited searches made it more strategic and actually improved output quality.

Sometimes less autonomy is better.

5. Real-time UX matters more than I thought

SSE streaming took 2 hours to implement and made the experience 10x better. Seeing progress removes anxiety about whether it crashed or is still working.

Small effort, huge UX impact.

The Honest Take

This tool won't guarantee your idea succeeds. It won't replace talking to real users. But it will:

  • Find competitors you didn't know about
  • Surface red flags early
  • Give you specific next steps to validate assumptions
  • Force you to look at evidence instead of just vibes

I've used it on 6 ideas in the last month. Two got brutal scores with clear reasons why (saved me months). Three got mid-range scores with specific pivots. One got 8/10 and I'm building it now.

That's the value: not predicting success, but helping you avoid obvious failures faster.

Try it: github.com/kzeitar/idea-sieve

Contribute: PRs welcome for better prompts, additional frameworks, or UX improvements.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Accidental Research at 15: How It Started

Next Post

What we learned building SalesBot — HubSpot’s AI-powered chatbot selling assistant

Related Posts