Back to projects

Voice Tutor — Spanish

Voice-first AI agent for learning Spanish through natural conversation — real-time STT, LLM orchestration, TTS, 5 teaching modes, and persistent sessions.

Voice-first — no typing, just talk5 modes: lesson, quiz, roleplay, chat, doubt resolutionReal-time STT → LLM → TTS pipelineAdmin CRUD UI for lessons and quizzesPer-turn observability (STT, LLM, TTS, e2e latency)

Problem

Language apps rely on tapping buttons and filling blanks — learners never actually speak, so they can't build real conversation skills.

Approach

Built a voice-first agent with Deepgram STT → GPT-4o tool dispatch → Cartesia TTS, running over LiveKit WebRTC. The agent switches between 5 modes (casual chat, structured lessons, quizzes, roleplay conversation, doubt resolution), semantically grades answers via LLM, persists sessions to Neon Postgres, and logs per-turn latency metrics.

Value

Learners practice real spoken conversation, not screen-tapping. Every turn feels like talking to a real tutor — and the system gets smarter about each learner's progress.

Snapshot

User speaks → Deepgram STT → GPT-4o picks a tool (lesson, quiz, roleplay, etc.) → Cartesia TTS talks back — all over LiveKit WebRTC with barge-in support, persistent sessions, and an admin UI for CRUD on lessons, quizzes, and questions.

Stack

  • Python (FastAPI)
  • Next.js + TypeScript
  • LiveKit Cloud (WebRTC)
  • Deepgram (STT)
  • Cartesia (TTS)
  • OpenAI GPT-4o
  • Neon Postgres

Role

  • Agentic system design
  • Voice pipeline (STT → LLM → TTS)
  • Full-stack build
  • Admin CRUD UI
  • Per-turn observability

Outcomes

  • Hands-free spoken conversation practice
  • 5 teaching modes with smart LLM grading
  • Persistent sessions with resume capability
  • Admin CRUD interface for content management
  • Per-turn latency metrics (STT, LLM TTFT, TTS TTFB, e2e)

Build notes

  • Deepgram nova-3 for STT, Cartesia aura-2 for TTS — both chosen for speed.
  • GPT-4o orchestrates 9 tools: start lesson/quiz, submit answer, track mistakes, score progress, etc.
  • Answers semantically graded via LLM — not keyword-matched.
  • Session summaries feed personalization; pick up where you left off.
  • Barge-in cuts off TTS and agent listens again immediately.
  • Every turn logs STT text, LLM TTFT, TTS TTFB, and e2e latency with P50/P95 at session end.

Roadmap

  • Multi-language support (beyond Spanish).
  • Pronunciation scoring and feedback.
  • Adaptive difficulty based on learner performance.

Want something similar built for your product?

I'll scope the path, then ship with reliability in mind.