Evaluation
Human-centered evaluation for AI agents
Benchmarks, rubric design, and product telemetry for measuring whether agentic systems behave well outside curated demos.
Name pronunciation: “Juh Soon”
I work at the boundary of academic AI research and product execution: turning models, evaluations, and human workflows into systems people can trust, inspect, and use.
Research threads I want the site to make easy to scan, cite, and discuss.
Evaluation
Benchmarks, rubric design, and product telemetry for measuring whether agentic systems behave well outside curated demos.
Interaction
Design patterns that expose uncertainty, provenance, and revision control without making advanced AI tools feel heavy.
Systems
Translating papers into resilient product architectures: retrieval, feedback loops, eval harnesses, and monitoring.
Selected product directions for translating research capability into useful workflows.
Research tooling
A product concept for collecting papers, extracting claims, comparing evidence, and producing citable research notes.
Enterprise AI
Dashboards, datasets, and release gates for teams deciding when an AI feature is good enough to ship.
Workflow design
Review queues, editable drafts, confidence thresholds, and intervention models for practical AI operations.
Knowledge systems
Structuring internal knowledge so retrieval and generation systems can stay grounded, current, and auditable.
“The work I want this site to foreground is research that survives contact with messy product reality: measurable, inspectable, and useful to the people making decisions.”
Reusable building blocks for the academic/product profile.
Concise positioning for publications, preprints, talks, and ongoing questions.
View researchCase-study slots for shipped tools, prototypes, evaluations, and strategy work.
View productsEssays that connect model capability, evaluation culture, and product judgment.
Read writingClear entry points for collaboration with labs, founders, and product teams.
Start a conversationA practical note on choosing eval metrics before the prototype starts shaping the research question.
Why legibility, interruption, and repair often matter more than an impressive autonomous run.
Patterns for making model confidence useful without burying users in instrumentation.
How to keep academic rigor alive when the work also has to ship.