Workflows and Evals

A workflow makes the same change across many files. You define the transformation once in your task instructions, specify where to apply it with a query, then run. Tern handles the repetition.

Anatomy of a Workflow

Two pieces:

Instructions define what should happen. These are markdown-structured instructions that the AI agent reads: steps to follow, context to gather, validations to run. See Task Instructions for the full syntax.

Query defines where to apply it. This finds matching files across your codebase. See Search Syntax for how to write precise queries.

Running a Workflow

When you run a workflow, each matched file gets its own agent session. The agent reads your instructions, executes the steps, and applies changes. You can configure multiple agent sessions per file, including adversarial LLM checks that validate results before accepting them.

The result is a grid: every matching file × every step. You can see which cells passed or failed, inspect timing, and read the raw agent conversations. Changes land on disk for you to review and commit.

Testing Workflows with Evals

The problem: you want to iterate on your instructions, but you don’t want to break what was already working.

Golden files solve this. These are files from previous runs where the transformation succeeded. They’re locked to the git SHA they came from, preserving the codebase as it was when you ran.

An eval run takes your current instructions and runs them against the golden files at their original SHA. Prompts are versioned implicitly on every change, so you’re comparing: does this version of the instructions still produce correct results?

Eval runs are different from normal runs. Normal runs transform code and leave changes for you to commit. Eval runs are ephemeral, and nothing changes on disk. You’re accumulating results to score the prompt, not modifying code.

What You Can Measure

Success: Did the transformation apply correctly? Eval runs compare against your golden results to catch regressions.

Timing: How long did each step take? Consistency matters. If a step takes 10 seconds on one file and 2 minutes on another, something’s different about those files.

Cost: Tokens and dollars spent on the run.

Codemod Steps

Some transformations are mechanical—the same AST change on every file. For these, Tern can generate a codemod instead of running an LLM per file.

When you mark a step as a codemod, Tern generates a JavaScript transform from your instructions, validates it, then runs it directly. Faster and cheaper than per-file LLM calls. Codemods auto-invalidate if you change the step instructions.

See Task Instructions for syntax.