Case Study:

Saga — Building Observability for an AI Travel Extraction System

Impact Summary

Built an internal tool to see what our AI extraction system was actually doing (instead of guessing)
Created a real dataset + ground truth so we could measure prompt performance properly
Found major issues early (e.g. flights had 0% perfect extraction)
Improved key areas like flight parsing by ~40%+ through targeted iterations
Helped shift the team from “tweak and hope” → structured testing and iteration
Pushed the system from a single prompt → a more reliable multi-step workflow

Context

Saga is an AI-powered product by DTravel that connects to a user’s email and photo library to automatically reconstruct travel history. It parses unstructured inputs—flight confirmations, hotel bookings, tickets, and receipts—and converts them into structured trips visualized on a map.

The underlying system relied on LLM-based prompt extraction to convert raw emails into structured travel activities.

Early on, it became clear that accuracy at this layer was existential to the product. Trips are auto-generated—there’s no manual fallback—so if parsing fails, the experience breaks completely. Users don’t debug—they churn.

Problem

Despite the sophistication of the extraction system, it functioned as a black box.

No visibility into what data was being extracted from individual emails
No way to systematically evaluate prompt performance
No prompt versioning or controlled testing
Debugging required manual database tracing (Gmail ID → extraction ID → metadata)
Conflicting, anecdotal user feedback with no system-level signal

At the same time, the system itself was fragile:

A single prompt handled classification + extraction
Failures were hard to isolate (was it classification? parsing? formatting?)
Improvements in one area often caused regressions elsewhere

This wasn’t just a tooling gap—it was a systems problem.

My Role

This was a self-initiated project.

I identified the gap while working with the extraction system and proposed a solution to the team via a detailed strategy doc and Slack proposal—covering dataset design, evaluation methodology, and system architecture.

Execution velocity

MVP built solo in 4–5 days (including weekend/out-of-hours)
Deployed to dev and accessible to the team within ~1 week after buy-in
First measurable results within 2 weeks, with continuous iteration thereafter

I then:

Defined the evaluation model (field accuracy vs record accuracy)
Designed the workflows (dataset creation, labeling, prompt testing)
Built the internal tool end-to-end using Next.js
Drove alignment across product and engineering to adopt the system

This wasn’t just a UI layer—I was effectively defining how the team would reason about and improve an AI system.

Approach

The key insight was:

We weren’t blocked by model capability—we were blocked by lack of visibility.

Instead of continuing to tweak prompts blindly, I reframed the problem as building an evaluation and observability system around the model.

The goal was to create a tight feedback loop between:

Raw input (emails)
Expected output (ground truth)
Model output (LLM extraction)
Measurable evaluation

1. Creating a Training Dataset Pipeline

Screenshot of the email queue interface, showing a list of forwarded emails with metadata like sender, subject, and date. — Email Queue / Dataset Inbox

We introduced a shared inbox where team members could forward real travel emails.

This allowed us to:

Build a centralized dataset of real-world inputs
Continuously expand coverage across providers (Airbnb, Booking.com, airlines, etc.)
Avoid synthetic or overly clean test data

2. Ground Truth Labeling Interface

I designed a UI that allowed team members to manually label each email:

Travel relevance (yes/no)
Structured attributes (dates, location, provider, cost, etc.)

This produced a normalized JSON representation of truth, which became the benchmark for evaluation.

Ground truth labeling interface showing an email with structured fields to fill in such as travel relevance, dates, location, and provider. — Ground Truth Labeling UI

3. Prompt Testing Environment

I built a prompt iteration interface where:

Prompts could be edited and saved
Specific datasets could be selected for testing
Each email was processed through the LLM

The system compared LLM output vs ground truth JSON.

Prompt editing interface where prompts can be written, saved, and run against selected email datasets. — Prompt Playground

Dataset selector showing available email samples grouped by provider for targeted prompt testing. — Dataset Selector

4. Evaluation & Metrics Layer

Instead of binary success/failure, the tool exposed granular metrics:

Field-level accuracy (e.g., date correctness, location accuracy)
Detection accuracy (is this a travel email or not)
Provider extraction correctness

This enabled:

Identifying partial improvements
Detecting regressions across specific fields
Understanding trade-offs between prompt changes

Reports tab showing overall evaluation metrics including field accuracy, missing fields, and incorrect extraction rates across all emails. — Reports Dashboard

Detailed report view showing per-field accuracy breakdown with pass, fail, and missing counts for each extraction field. — Metrics Detail View

Key Decisions & Tradeoffs

Real Data over Synthetic Data

Using forwarded emails introduced noise and inconsistency, but ensured the system was optimized for real-world variability rather than ideal cases.

Manual Labeling Investment

Ground truth labeling required upfront effort, but created a durable evaluation asset that improved over time.

Granular Metrics vs Simplicity

Rather than a single accuracy score, we exposed multi-dimensional metrics to reflect the true complexity of extraction tasks.

System over Prompt Tweaks

Instead of jumping to a more complex multi-step prompting strategy (classification → extraction → validation), I prioritized building visibility first—ensuring we could measure impact before increasing system complexity.

Outcome

Main dashboard of the internal observability tool showing email processing status, dataset stats, and high-level extraction metrics. — Internal Observability Tool — Dashboard

The tool transformed prompt tuning from guesswork into a measurable system:

Established a baseline evaluation across 100+ real emails
Shifted the team from anecdotal debugging to metric-driven iteration
Made regressions immediately visible at a field level
Enabled targeted improvements by isolating failure categories (e.g. flights vs stays)
Introduced a repeatable workflow the team could use continuously (dataset → test → compare → iterate)

Baseline Findings

Email-level observability view showing extracted fields alongside the raw email content for side-by-side comparison. — Email-level Extraction Observability

The first evaluation exposed how far the system was from production readiness:

Perfect record rate: 30.3% overall, but <4% for actual travel emails
Field accuracy: 63.4%
Missing fields: 25.8%
Incorrect fields: 26.2%

Errors were evenly split between:

Recall issues (missing fields)
Precision issues (incorrect extraction)

This indicated systemic weaknesses in both field detection and value extraction—not just a single failure point.

Flights emerged as the critical failure category:

0% perfect extraction
High incorrect values for structured fields (dates, airports, flight numbers)

Iteration Example — Flights

Using the tool, I focused on improving flight extraction through:

Better few-shot examples
Stronger field-level instructions
More explicit schema constraints

Results after one iteration:

Field accuracy improved from 53.3% → 76.1% (+42.8%)
Incorrect fields reduced from 29.8% → 6.1% (-79.5%)
Perfect records increased from 0% → 8.3%

This demonstrated the value of targeted, measurable improvements over broad prompt changes.

System-Level Learnings

The evaluation also revealed deeper architectural gaps:

LLMs struggled with timezone accuracy (DST issues) → required deterministic post-processing
Critical data in PDF attachments was completely missed → required pre-processing pipeline
One-shot prompting was insufficient → needed multi-step workflows (classification → extraction → validation)

This led to a shift from a single prompt system to a composable extraction pipeline.

Reflection

This project reinforced that designing AI products is not just about model output—it’s about building the systems around the model.

By introducing structure, evaluation, and visibility, we were able to treat prompt engineering as a product surface rather than a hidden backend concern.

It also highlighted the value of working across disciplines—combining UX thinking, engineering execution, and product strategy to solve a fundamentally ambiguous problem.