Case study · 2025 to present
Praxium: source-grounded, narrated courses from any PDF
A SaaS that ingests a PDF and produces a structured, narrated course: personas, ABCD learning objectives, failure-mode analysis, a hierarchical outline, source-grounded per-subsection content with verbatim reference snippets, inline micro-checks (matching, ordering, fill-blank), KaTeX equations, end-of-lesson MCQs with confetti, narrated Synthesia videos, and a retrieval-grounded chat over the source. Driven by a resumable, multi-state instructional-design state machine, with a multi-dimension eval harness that scores coverage, grounding, coherence, persona alignment, and MCQ quality.
Lead engineer · Toptal engagement
Stack
React 18 · TypeScript · Vite · Tailwind · Radix / shadcn · TanStack Query · react-markdown + remark-math + rehype-katex · KaTeX · canvas-confetti · @xyflow/react + dagre · PostHog · FastAPI · Pydantic v2 · SQLAlchemy 2 async · Alembic · PostgreSQL · pgvector · Redis · Celery · Anthropic Claude (instructional pipeline) · OpenAI SDK · Cohere Embed v4 via AWS Bedrock · Landing AI ADE Parse Jobs · Synthesia · ElevenLabs · ffmpeg, LibreOffice, Playwright (slide rendering) · AWS S3 (presigned URLs) · Stripe (Checkout, Billing Portal, webhooks) · Clerk (JWT + Svix webhook) · Langfuse + OpenTelemetry (Anthropic instrumentation) · Sentry
Outcomes
- Live product at getpraxium.ai covering ingestion, parsing, instructional-design generation, source-grounded narrated lessons, payments, and retrieval-grounded explore mode.
- Resumable workflow state machine with explicit wait states for persona pick, instructional-goal approval, objectives review, and failure-mode analysis; checkpoints on disk plus DB-backed resume paths so 30-minute generations never replay Claude from scratch.
- Multi-dimension eval harness scoring coverage (objective keyword hit rate), grounding (verbatim snippet fuzzy-match against chunk text), coherence (LLM judge), persona alignment (LLM judge), and MCQ quality (programmatic trap-option detection, longest-option bias guard, duplicate detection).
- Source-truth grounding via Landing AI ADE chunking with grounding boxes, BM25 plus cosine hybrid retrieval over Cohere Embed v4 embeddings on AWS Bedrock, and Claude prompts that require verbatim reference snippets keyed to a chunk id on every key point.
- Learner engagement surface: inline micro-checks in three variants (matching, ordering, fill-blank), end-of-lesson MCQ pass with dual-corner canvas-confetti, KaTeX equations with
$...$and$$...$$delimiters and code-fence carve-outs, persisted quiz scores, lesson progress streaks. - Async Synthesia video generation with polling, per-key-point clips, a structured per-course videos lifecycle, and full delete and regenerate flow; ElevenLabs TTS plus ffmpeg slide movies as a multimedia worker path.
- Production observability with Langfuse plus OpenTelemetry's Anthropic instrumentor, Sentry on both Python and the frontend, and Stripe event durability via a dedicated event service.
What I own
End-to-end product engineering: the agentic instructional-design pipeline, the workflow state machine, the eval harness, the source-truth grounding stack, the inline learner-engagement components, the video generation lifecycle, the multi-tenant data model on async SQLAlchemy and Postgres, and the API contract the frontend consumes. Stripe, Clerk, S3, and the observability wiring (Langfuse, OpenTelemetry, Sentry) are mine too.
Workflow state machine
The README narrates a short ladder; the orchestrator behind it is materially richer. The workflow state machine exposes a multi-step DAG with explicit wait states (personas, persona-pick wait, duration wait, instructional-goal approval, objectives review, failure modes, outline, content generation, GENERATED, failures) so a 30-minute course generation never replays Claude from scratch on a transient failure. Checkpoints persist to disk next to the per-course S3 metadata, with DB-backed resume paths so the frontend can poll progress while the backend keeps working.
Eval harness
The course-quality evaluator pulls finished courses from a per-course S3 prefix and scores five dimensions. Coverage is the fraction of objective stem words appearing in aggregated subsection text. Grounding is the share of key-point reference snippets that fuzzy-match the cited chunk text at a 0.8 SequenceMatcher threshold, with a substring fast path. Coherence and persona alignment use cached Claude judges driven by rubric YAML prompts. MCQ quality runs programmatic checks (trap options like “all/none of the above,” longest-option bias guard, duplicate detection, minimum three options). Judge calls are cached to disk between runs. A separate retrieval study compares keyword, BM25, semantic cosine, hybrid RRF, and optional Cohere rerank against the ground-truth chunk set. Both report directly into the repo as markdown.
Source-truth grounding
PDFs are parsed by Landing AI ADE Parse Jobs into chunk-id-keyed markdown chunks with grounding boxes (page, bbox). At authoring time, a hybrid retriever (BM25 fused with cosine over Cohere Embed v4 embeddings on AWS Bedrock) selects the chunks Claude sees, and the prompt contract requires every key point to attach a verbatim reference snippet keyed to a chunk id. The eval harness then re-verifies those snippets against chunk text, so an ungrounded sentence loses its grounding score and shows up in the generated quality report. Explore mode runs the same retriever at top-k 4 and a 0.3 similarity threshold to ground answers on neighboring chunks.
Equation rendering
The subsection-authoring prompt fixes notation: $...$ for inline, $$...$$ for display, backticks for code, and an explicit ban on bare currency symbols to avoid runaway math regions. The frontend imports KaTeX styles globally; the Markdown component wires remark-math and rehype-katex; a formatted-text helper segments inline text, gates $...$ bodies with a math-like guard, and renders display blocks with \displaystyle. Micro-check leaf components inherit KaTeX typography classes so math renders cleanly inside fill-blank and matching answers.
Learner engagement
Inline micro-checks come in three variants (matching, ordering, fill-blank), gated to zero, one, or two per lesson by prompt contract. End-of-lesson quizzes trigger a dual-corner canvas-confetti burst on a passing score; a progress hook PATCHes the quiz score to the course progress API, and a local streak counter persists viewed lesson IDs. The diagrams surface uses @xyflow/react with a dagre layout pass. Progress reads and writes are designed so the engagement layer fails independently of the LLM layer.
Production posture
Clerk JWT middleware plus a Svix-verified webhook syncs users. Stripe checkout, portal, and webhooks sit behind a thin client and a dedicated event service, with durable event handling. Synthesia uses two template IDs and persists a videos.json under the course metadata. S3 is laid out as courses/{course_id}/{source|metadata|content|grounding}/. Observability initialises Langfuse alongside OpenTelemetry’s Anthropic instrumentor and flushes on shutdown; Sentry is wired on both Python and the frontend. The Dockerfile bakes LibreOffice, ffmpeg, and Playwright Chromium for slide and video paths.
Lessons
Breaking generation into checkpoints with explicit wait states buys you retries and product surfaces (persona pick, approvals) without replaying Claude from scratch, something a single-shot script cannot match when a run lasts tens of minutes. Faithfulness turns out orthogonal to readability: lexical grounding checks are inexpensive and objective, coherence and persona judgments need slow, noisy judges, and embedding drift metrics only help once you tolerate false alarms on legitimate paraphrase. Engagement work here is disproportionately React affordances, motion budgets, Markdown math correctness, micro-check UX, celebration timing, and persistence of quiz attempts; almost none of it is “another paragraph from the LM.” Keeping those layers separate avoids the trap of asking the tutor model to compensate for unclear progress UI.
This engagement is ongoing; this page will be updated as the product evolves.