Evaluation of LLM Outputs

Rubrics, LLM-as-Judge, and Iterative Improvement Loops

Week 10 of Phase 3: Advanced Patterns (Weeks 9-12)

Lecture, Practice, and Discussion for Week 10

📖 1. Lecture

The evaluation gap — we build, but how do we measure quality?
LLM-as-a-Judge — using LLMs to score LLM outputs
Three strategies: naive, refine, aware; then iterate

💻 2. Practice

Build a Hometown Introduction Generator
Compare 3 generation strategies + multi-model evaluation + iterative loop

🗣️ 3. Discussion

Week 9 — Ambiguity vs Clarity: why agents fail at vague tasks
Connection: rubrics as disambiguation mechanism

Part 1: Lecture

How to Measure Whether an LLM Output is Good

The Evaluation Gap — Where We Are

📄 What We've Built So Far

Week 5-7: chat, metadata, debate
Week 9: agents that call tools
All produce outputs — but how good are they?

❓ The Uncomfortable Question

"Looks good to me" — but is that enough?
Different LLMs give different answers — which is best?
You changed the prompt — did it actually improve?
Without measurement, you're flying blind

🎯 Today's Goal

Turn "good" from a feeling into a score
Build a system that measures itself, then improves itself

Why Evaluating LLMs is Hard

🤔 "Good" is Subjective

"Write me a report on Seoul" → infinite valid answers
No single ground truth like math problems
Reasonable people disagree on what's "best"

💡 The Trick — Operationalize "Good"

Don't try to define "good" in one word
Break it into measurable criteria: completeness, accuracy, structure, engagement
Each criterion gets a score → total = quality estimate

📋 This is a Rubric

Same idea as a grading rubric in school
Subjective overall, but each criterion is concrete enough to score
Once you have a rubric, you can measure ANYTHING

LLM-as-a-Judge — The Core Idea

🧑‍⚖️ The Setup

You have an output from LLM A
You ask LLM B (the judge): "Score this on these criteria"
LLM B returns numerical scores per criterion
Average = quality estimate

✅ Why It Works

LLMs are good at applying explicit criteria to text
Cheap and fast (vs hiring human raters)
Reproducible (same input → similar score)

⚠️ Limits to Know

Same model judging itself = biased (likes its own style)
Use a different model as judge when possible
Judge can be wrong — sanity check on a few examples

Three Strategies — Where Does the Rubric Go?

1️⃣ Naive — Just Ask

"Write an introduction to my hometown"
No criteria mentioned, LLM guesses what's wanted
Baseline — simplest possible approach

2️⃣ Refine — Ask Then Improve

Step 1: get a draft (naive)
Step 2: "Here's a draft, here's the rubric — rewrite to score better"
Two LLM calls, but the rubric is applied AFTER

3️⃣ Aware — Include Rubric Upfront

"Write an introduction. It will be scored on these criteria: [rubric]"
Single LLM call, rubric embedded in the request
The model knows the test before it writes

"Same task, three ways to use the rubric. Which gives the highest score?"

The Iterative Loop — Generate, Score, Improve

graph LR A[Initial Output] --> B[LLM Judge Scores] B --> C{Stopping
Condition?} C -->|No| D[Find Weak Criteria] D --> E[Targeted Refinement] E --> B C -->|Yes| F[✅ Final Output] style A fill:#fff3cd style B fill:#cce5ff style D fill:#ffe5cc style E fill:#ffe5cc style F fill:#d4edda

💡 Two Stopping Strategies

Threshold mode — stop when all criteria ≥ target (e.g., all ≥ 8)
Self-evolving mode — keep going until best score hasn't improved in N iterations (patience)
Threshold = "good enough"; Self-evolving = "push to the ceiling"

Self-Evolving Mode — Push Beyond "Good Enough"

Track the best, refine from the best, stop on no improvement

📈 Track the Best

Don't just look at the latest iteration
Keep a record of the highest-scoring output so far
The latest iteration may have regressed — that's OK, we keep the best

🔄 Refine FROM the Best, Not the Latest

At each iteration, refine starting from the best text, not the last one
This prevents the loop from drifting downward after a bad refinement
Conceptually: hill-climbing with memory

⏱️ Patience-Based Stopping

Set patience = 3 — wait 3 iterations for a new best
If 3 consecutive iterations don't beat the best → stop
You've likely hit the model's ceiling for this task

"Threshold says 'I'm done when it's good enough.' Self-evolving says 'I'm done when I can't make it any better.'"

Multi-Model Comparison — Same Task, Different Models

🔬 Why Compare Models?

GPT-4, Claude, Gemini, Llama — all give different answers
Some are better at facts, others at structure, others at style
Without measurement, you'd just pick one and hope

📊 The Setup

Run the same prompt on N models
Score each output with the same rubric (using the same judge)
Plot: model on x-axis, score per criterion on y-axis
Now you can pick the right model for the right task

⚠️ Important — Use One Judge

Different judges give different absolute scores
For comparison, judge MUST be consistent across all candidates
Best: use a strong model as judge that didn't generate any candidate

Lecture Summary — Evaluation

📋 Rubrics

Operationalize "good" into measurable criteria
Same idea as grading rubrics, applied to AI outputs
This is the bridge from subjective to scorable

🧑‍⚖️ LLM-as-a-Judge

Use one LLM to score another's output
Use a different model as judge to avoid self-bias
Cheap, fast, reproducible — but verify on samples

🔁 Three Strategies + Two Loop Modes

Naive / Refine / Aware — three ways to use a rubric
Threshold mode: stop when good enough; Self-evolving mode: refine FROM best, stop on patience
Multi-model comparison picks the right tool for the job

Part 2: Practice

Hometown Introduction — Generate, Score, Improve

Practice Overview — What We'll Build

🎯 The Task

Generate an introduction to your hometown (1 paragraph)
Score it with a 5-criterion rubric
Compare 3 strategies + multiple models + iterative loop

📁 New File

evaluator.py — rubric, generation strategies, judge, loop
app.py — add Tab 8: Evaluation Lab

⚡ Why Hometown?

Concrete (you know the ground truth — your own town)
Quality is genuinely subjective (good for rubric practice)
Different aspects (history / food / geography) → multi-criterion natural

Step 1 — Define the Rubric (`evaluator.py`)

Five criteria, each with an explicit definition

# evaluator.py
RUBRIC = {
    "completeness": (
        "Covers multiple aspects: history, culture, food, geography, and people. "
        "Score 0 if only one aspect; 10 if all five are present."
    ),
    "specificity": (
        "Uses concrete details (place names, dishes, festivals) rather than generic claims. "
        "Score 0 for vague platitudes; 10 for vivid specifics."
    ),
    "structure": (
        "Has logical flow with a clear opening, body, and closing. "
        "Score 0 for disorganized text; 10 for well-paragraphed prose."
    ),
    "engagement": (
        "Reads as something someone would WANT to read, not a list of facts. "
        "Score 0 for dry encyclopedia tone; 10 for vivid storytelling."
    ),
    "accuracy_caution": (
        "Avoids confident claims that could be wrong (specific dates, statistics). "
        "Score 0 if it invents specific facts; 10 if it stays within safe knowledge."
    ),
}

💡 Each Criterion is Self-Contained

Has a name, a definition, and a scoring rule
The judge doesn't need to guess what "engagement" means
This is the disambiguation work

Step 2 — Three Generation Strategies

One small llm_call helper, three different prompts

# evaluator.py — using google-genai SDK
def llm_call(client, model, prompt):
    """Single-turn call to a Gemini model."""
    resp = client.models.generate_content(
        model=model,
        contents=prompt,
    )
    return resp.text


def generate_naive(client, model, hometown):
    """Strategy 1: just ask, no rubric."""
    return llm_call(client, model, f"Write a 1-paragraph introduction to {hometown}.")


def generate_aware(client, model, hometown):
    """Strategy 3: include rubric in the prompt upfront."""
    rubric_text = "\n".join(f"- {k}: {v}" for k, v in RUBRIC.items())
    prompt = f"""Write a 1-paragraph introduction to {hometown}.

Your output will be scored on these criteria:
{rubric_text}

Address each criterion in your writing."""
    return llm_call(client, model, prompt)


def generate_refine(client, model, hometown):
    """Strategy 2: generate first, then refine with rubric."""
    draft = generate_naive(client, model, hometown)
    rubric_text = "\n".join(f"- {k}: {v}" for k, v in RUBRIC.items())
    prompt = f"""Original draft:
{draft}

Rewrite this to score better on these criteria:
{rubric_text}"""
    return llm_call(client, model, prompt)

💡 SDK Note

Uses google-genai (native Gemini SDK), not OpenAI-compatible
client.models.generate_content(model=..., contents=...) — single string in, .text out
All later functions reuse this single llm_call helper

Step 3 — The LLM Judge (`evaluator.py`)

Same rubric, structured JSON output

import json, re

def evaluate(client, judge_model, text):
    """Score text on each rubric criterion (0-10)."""
    rubric_text = "\n".join(f"- {k}: {v}" for k, v in RUBRIC.items())
    prompt = f"""Score the following text on each criterion (0-10 integer).

Text:
{text}

Criteria:
{rubric_text}

Return ONLY a JSON object like:
{{"completeness": 7, "specificity": 8, "structure": 6, "engagement": 7, "accuracy_caution": 9}}"""

    raw = llm_call(client, judge_model, prompt)
    # Extract JSON from response (LLMs sometimes wrap it in code blocks)
    match = re.search(r"\{[^}]+\}", raw, re.DOTALL)
    if not match:
        return {k: 0 for k in RUBRIC}
    try:
        return json.loads(match.group(0))
    except json.JSONDecodeError:
        return {k: 0 for k in RUBRIC}

💡 Robust Parsing

LLMs may add commentary around JSON — extract with regex
Fall back to zero scores on parse failure (visible in UI)
For production: add retries with stricter formatting prompts

Step 4a — App Setup (`app.py` — top of file)

Standalone Streamlit app — client, model, imports

# app.py
import os
from dotenv import load_dotenv
from google import genai

load_dotenv()

client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))
model = os.environ.get("LLM_MODEL", "gemini-2.5-flash")

import streamlit as st
import pandas as pd, altair as alt
from evaluator import (
    RUBRIC, generate_naive, generate_aware, generate_refine, evaluate,
    iterative_threshold, iterative_self_evolving
)

💡 Why a Standalone App?

This week's evaluator is independent of Weeks 5-7 (no shared tabs)
Run with streamlit run app.py — fresh single-page UI
LLM_MODEL env var lets you swap the generator without editing code

Step 4b — UI: Compare 3 Strategies

Header, hometown input, judge selector, then run all three

st.header("📊 Evaluation Lab — Hometown Introductions")
hometown = st.text_input("Your hometown", "Daejeon, South Korea")
judge = st.selectbox(
    "Judge model",
    ["gemini-3-flash-preview", "gemini-2.0-flash"],
)

if st.button("🚀 Run All 3 Strategies", disabled=not hometown):
    rows = []
    for name, fn in [
        ("Naive", generate_naive),
        ("Refine", generate_refine),
        ("Aware", generate_aware),
    ]:
        with st.spinner(f"{name}..."):
            text = fn(client, model, hometown)
            scores = evaluate(client, judge, text)
        st.subheader(f"{name} Strategy")
        st.write(text)
        st.json(scores)
        for k, v in scores.items():
            rows.append({"strategy": name, "criterion": k, "score": v})

    df = pd.DataFrame(rows)
    chart = alt.Chart(df).mark_bar().encode(
        x="score:Q", y="criterion:N", color="strategy:N", row="strategy:N",
    )
    st.altair_chart(chart, use_container_width=True)

💡 Run This First

Confirms rubric + judge + 3 strategies all work
Building block for the iterative loops next
Different model for generation (model) and judging (judge) — avoids self-bias

Step 5a — Threshold Loop (function in `evaluator.py`)

The simplest improvement loop — stop when good enough

def iterative_threshold(client, gen_model, judge_model, hometown,
                        max_iters=4, threshold=8):
    """Generate → score → refine weak criteria. Stop when all ≥ threshold."""
    text = generate_aware(client, gen_model, hometown)
    history = []

    for i in range(max_iters):
        scores = evaluate(client, judge_model, text)
        avg = sum(scores.values()) / len(scores)
        history.append({"iter": i, "text": text, "scores": scores, "avg": avg})

        weak = [k for k, v in scores.items() if v < threshold]
        if not weak:
            break  # all criteria above threshold → done

        weak_defs = "\n".join(f"- {k}: {RUBRIC[k]}" for k in weak)
        prompt = f"""Current text:
{text}

It scored low on these criteria:
{weak_defs}

Rewrite the text to specifically improve these aspects. Keep what works."""
        text = llm_call(client, gen_model, prompt)

    return history

💡 The Logic

At each iteration, identify weak criteria (below threshold)
Refine targeting only those criteria
Exit as soon as all criteria are above the bar — "good enough"

Step 5b — UI: Threshold Loop (append to `app.py`)

Run this section by itself — independent button

# Append below the Compare-Strategies block
st.divider()
st.subheader("🎯 Threshold Loop")

th_max_iters = st.slider("Max iterations", 1, 10, 4, key="th_iters")
th_threshold = st.slider("Threshold per criterion", 5, 10, 8, key="th_th")

if st.button("🔁 Run Threshold Loop"):
    with st.spinner("Iterating..."):
        history = iterative_threshold(
            client, model, judge, hometown,
            max_iters=th_max_iters, threshold=th_threshold,
        )

    # Score trajectory per criterion
    traj = pd.DataFrame([
        {"iter": h["iter"], "criterion": k, "score": v}
        for h in history for k, v in h["scores"].items()
    ])
    line = alt.Chart(traj).mark_line(point=True).encode(
        x="iter:O", y="score:Q", color="criterion:N",
    )
    st.altair_chart(line, use_container_width=True)

    final = history[-1]
    st.subheader(f"Final — iter {final['iter']}, avg {final['avg']:.1f}")
    st.write(final["text"])
    st.json(final["scores"])

💡 What to Observe

Does the loop stop early (all ≥ threshold) or run to max?
Which criteria are the last to clear?
Try lowering threshold to 7 — does it stop faster?

One Step Further — Why Threshold Isn't Always Enough

From "good enough" to "best possible"

⚠️ Threshold Mode Limits

Stops at the FIRST output where all scores clear the bar
Doesn't try to keep improving past that point
The latest iteration may also have regressed in some criteria

💡 What If We Want the BEST?

Track the highest-scoring output seen so far (best)
Each refinement starts from best, not from latest
Stop only when we can't beat best for N rounds (patience)
This is self-evolving — pushing to the ceiling

🎯 When to Use Each

Threshold: production tasks where "passing the bar" is the goal
Self-evolving: research/exploration when you want maximum quality

Step 6a — Self-Evolving Loop (function in `evaluator.py`)

Track best, refine from best, patience-based stopping

def iterative_self_evolving(client, gen_model, judge_model, hometown,
                            max_iters=10, patience=3):
    """Track best, refine FROM best, stop when best hasn't improved in `patience` iters."""
    text = generate_aware(client, gen_model, hometown)
    history = []
    best = {"avg": -1, "text": "", "scores": {}, "iter": -1}
    no_improve = 0

    for i in range(max_iters):
        scores = evaluate(client, judge_model, text)
        avg = sum(scores.values()) / len(scores)
        history.append({"iter": i, "text": text, "scores": scores, "avg": avg})

        if avg > best["avg"]:
            best = {"avg": avg, "text": text, "scores": scores, "iter": i}
            no_improve = 0
        else:
            no_improve += 1

        if no_improve >= patience:
            break  # no new best for `patience` rounds → ceiling reached

        # KEY: refine FROM best['text'], not from latest text
        min_score = min(best["scores"].values())
        weak = [k for k, v in best["scores"].items() if v <= min_score + 1]
        weak_defs = "\n".join(f"- {k}: {RUBRIC[k]}" for k in weak)
        prompt = f"""Current best text (avg {best['avg']:.1f}):
{best['text']}

These criteria are the weakest in the current best:
{weak_defs}

Rewrite to push these specific criteria higher. Preserve the strengths."""
        text = llm_call(client, gen_model, prompt)

    return history, best

💡 The Three Key Differences from Threshold

Tracks best dictionary separately from text
Refines from best['text'] — not from latest (which might have regressed)
Stops on patience — N rounds without a new best

Step 6b — UI: Self-Evolving Loop (append to `app.py`)

Add iterative_self_evolving to the import line first

# At the top of app.py, update the import:
from evaluator import (
    RUBRIC, generate_naive, generate_aware, generate_refine, evaluate,
    iterative_threshold, iterative_self_evolving,
)

# Append below the Threshold Loop block
st.divider()
st.subheader("🚀 Self-Evolving Loop")

se_max_iters = st.slider("Max iterations", 3, 20, 10, key="se_iters")
se_patience = st.slider("Patience (no-improve count)", 1, 5, 3, key="se_pat")

if st.button("🔁 Run Self-Evolving Loop"):
    with st.spinner("Iterating..."):
        history, best = iterative_self_evolving(
            client, model, judge, hometown,
            max_iters=se_max_iters, patience=se_patience,
        )

    # Plot avg per iteration AND running best (the ratchet)
    avg_df = pd.DataFrame([
        {"iter": h["iter"], "metric": "avg", "score": h["avg"]} for h in history
    ])
    running_best = []
    bs = -1
    for h in history:
        bs = max(bs, h["avg"])
        running_best.append({"iter": h["iter"], "metric": "best", "score": bs})

    line = alt.Chart(pd.concat([avg_df, pd.DataFrame(running_best)])).mark_line(
        point=True).encode(x="iter:O", y="score:Q", color="metric:N")
    st.altair_chart(line, use_container_width=True)

    # Show the BEST output (not the latest)
    st.subheader(f"Best — iter {best['iter']}, avg {best['avg']:.1f}")
    st.write(best["text"])
    st.json(best["scores"])

💡 Read the Chart

avg zigzags — some iterations regress, that's normal
best is monotone non-decreasing — the ratchet
When best plateaus for patience iters → the loop stops
Output shown is best, not latest — this is the whole point

Step 7 — Multi-Model Comparison

Same task, same judge, different generators

# Append below the Self-Evolving Loop block
st.divider()
st.subheader("🔬 Multi-Model Comparison")

candidate_models = st.multiselect(
    "Models to compare",
    ["gemini-2.5-flash", "gemini-2.5-pro", "gemini-2.0-flash", "gemini-3-flash-preview"],
    default=["gemini-2.5-flash", "gemini-2.0-flash"],
)

if st.button("⚖️ Compare Models", disabled=not candidate_models):
    rows = []
    for m in candidate_models:
        text = generate_aware(client, m, hometown)
        scores = evaluate(client, judge, text)
        for k, v in scores.items():
            rows.append({"model": m, "criterion": k, "score": v})

    df = pd.DataFrame(rows)
    chart = alt.Chart(df).mark_bar().encode(
        x="score:Q", y="criterion:N", color="model:N", row="model:N",
    )
    st.altair_chart(chart, use_container_width=True)

💡 Pick the Right Model for the Right Task

One model may be best at structure but weak at engagement
Now you have evidence, not vibes

✅ Week 10 Practice Checklist

Complete these steps during the practice session:

1. - [ ] Define RUBRIC in evaluator.py with 5 criteria

2. - [ ] Implement generate_naive, generate_aware, generate_refine (using llm_call with google-genai SDK)

3. - [ ] Implement evaluate() with robust JSON parsing

4. - [ ] Set up app.py — .env with GOOGLE_API_KEY, genai.Client, imports

5. - [ ] Build the Compare 3 Strategies UI — does Aware score higher than Naive?

6. - [ ] Implement iterative_threshold() in evaluator.py

7. - [ ] Append the Threshold Loop UI section to app.py (independent button)

8. - [ ] Run with threshold=8 — does it stop early or hit max?

9. - [ ] Run with threshold=10 — does it now hit max iterations?

10. - [ ] Implement iterative_self_evolving() in evaluator.py

11. - [ ] Append the Self-Evolving Loop UI section to app.py

12. - [ ] Run with patience=3 — observe the avg vs best lines

13. - [ ] Compare final avg score with Stage 2 — did self-evolving find a higher peak?

14. - [ ] Multi-model comparison — which model wins per criterion?

15. - [ ] Edit one criterion's definition to be vague — see scores become unstable

Part 3: Discussion

Ambiguity vs Clarity — Why Vague Instructions Break Agents

Week 9 — Why Do Agents Fail at Vague Tasks?

12 responses analyzed — a Hulk-Captain America coalition emerges

📊 The Vote Distribution

Cap + Hulk (2,3): Waad, Manuella, Margareth, Minh — 4 votes (largest)
Hulk only (3): DongYun, Yadanar, Han — 3 votes
Iron Man only (1): Tan — 1 vote
Iron Man + Hulk (1,3): Irfan — 1 vote
Iron Man + Cap (1,2): Jaewhoon — 1 vote
Captain America only (2): Ly — 1 vote
None: Huy — 1 vote (rejects all three frames)

🤝 The Coalition's Position

Humans must specify clearly AND verify outputs
Pure Iron Man (only engineering) has 1 lonely vote (Tan)
The class converges on: clarity is a human responsibility, but the system must support it

🔥 Three Solution Schools Within the Coalition

Design school: build better interfaces (Irfan, Tan)
Demand school: humans must be precise (Ly, Waad, Manuella)
Tradeoff school: explicit about flexibility ↔ uncertainty (Margareth, Minh)

Theme 1 — The Flexibility-Uncertainty Tradeoff

Margareth's framing — and why "design vs demand" is incomplete

⚖️ Margareth's Sharp Insight

"More ambiguity allows greater flexibility BUT increases uncertainty"
"More clarity requires more human effort and reduces the system to conventional programming"
There's no free lunch — every step toward clarity is a step away from flexibility

🚧 Adding the Translation Layer Doesn't Solve It

"Introducing a 'middle layer' to translate vague instructions does not remove ambiguity but shifts it into another potentially equally complex problem"
Iron Man's middleware fix relocates the problem; doesn't dissolve it

🎯 The Real Question

Not "should we design for ambiguity OR demand clarity?"
But: how much flexibility am I willing to trade for how much certainty?
This is a per-task design choice, not a universal answer

"More clarity reduces the system to something closer to conventional programming." — Margareth

Theme 2 — Two Deepest Critiques: Huy & Minh

Beyond all three personas

💎 Huy — "Metacognitive Clarity"

All three personas assume the human KNOWS what they want
At the research frontier: often you don't fully know yet
The real skill: "knowing how well you currently understand what you want to find out"
Be honest about that gap before deploying autonomy

⚠️ Minh — "Independence Threshold" + "Alignment Drift"

At what point do we transition AI from instruction-follower → autonomous decision-maker?
Granting AI independence to resolve vagueness → Alignment Drift
System optimizes for "likely path" that violates implicit human ethics or physical constraints
"Right goal through wrong and potentially catastrophic logic"

🔗 The Common Thread

Both Huy and Minh point to a problem the personas miss
Personas frame it as communication — they frame it as agency boundaries
The deeper question: how much should the AI fill in for us, ever?

"An agent might achieve the right goal through the wrong logic, simply because we failed to define the boundaries of its autonomy." — Minh

Theme 3 — What Vague Prompts Actually Look Like

Yadanar's concrete examples make the abstract real

📊 "Analyze this dataset" — Yadanar's Example

Doesn't say what kind of analysis you want
Model picks "wrong patterns or even makes stuff up"
Vivid case of guessing under ambiguity

💻 "Improve this code"

Could change speed, readability, structure
"But not necessarily what you intended"
Each interpretation is plausible — and most are wrong for YOU

🔬 Why These Examples Land

These are commands students actually type every day
Each one is one short sentence — none feel "vague" subjectively
But each contains 3-5 hidden parameters the model has to guess
Awareness of these hidden parameters = metacognitive clarity in practice

"AI doesn't truly understand intent — it just fills gaps based on probability." — Yadanar

How This Connects to Today's Practice

Jaewhoon's "feedback loop" hits the lecture's core idea

🔁 Jaewhoon — "The Feedback Loop is Key"

"When a user provides specific feedback on an unsatisfactory output, the AI incorporates that learning"
"Simple rejection doesn't help — clear, constructive criticism leads to better iterations"
This is exactly what today's iterative loop is

🎯 The Rubric IS the Constructive Feedback

"Score on completeness, specificity, structure..." — that's what Jaewhoon means by constructive criticism
The judge produces structured feedback the model can actually use
This is Irfan's "design for ambiguity" — operationalized in code

💎 Connecting Huy's Critique

Writing the rubric = forcing metacognitive clarity
If you can't write a rubric, you don't yet know what "good" means to you
The system doesn't replace this — it surfaces the gap

⚖️ Connecting Margareth's Tradeoff

More criteria in the rubric = more clarity, less flexibility
Self-evolving loop pushes for the score ceiling — but at the cost of generation freedom
Choose the rubric depth that matches the certainty you need

Activity — Write a Rubric for Your Research

10 minutes in pairs

📋 The Task

Pick one common output in your field (e.g., literature summary, code review, model evaluation)
Write a 4-5 criterion rubric, each with explicit scoring guidance
Trade rubrics with a partner — would you score the same way?

✏️ Required Elements

Each criterion needs a name, definition, and scoring guidance (what gets 0 vs 10)
Avoid criteria that overlap (e.g., "clarity" and "readability")
Make sure each criterion is independently scorable

✅ The Test

Show your rubric to your partner WITHOUT showing the original task description
Can they guess what kind of output it's evaluating?
If yes → your rubric is specific enough

🗣️ Week 10 Discussion Questions (UST LMS)

Visit: UST LMS → Class → Discussion

1. Today you saw three strategies (naive, refine, aware) plus two loop modes (threshold, self-evolving). Which combination gave the biggest improvement on your hometown task? Connect your finding to Jaewhoon's claim that "the feedback loop is key."

2. Margareth argued that more clarity reduces the system "to something closer to conventional programming." After writing a strict rubric today, did your final output feel more constrained — and was that good or bad? Where would you draw the line between flexibility and reliability for your own research?

Want to Learn More?

LLM Evaluation

📚 Anthropic: Evaluating LLM Outputs 📚 OpenAI Evals — Open-source Eval Framework 📚 LangSmith — Evaluation and Tracing

Research Papers

📄 [G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (2023)](https://arxiv.org/abs/2303.16634)

📄 [Self-Refine: Iterative Refinement with Self-Feedback (2023)](https://arxiv.org/abs/2303.17651)

📄 [Constitutional AI — Anthropic](https://arxiv.org/abs/2212.08073)

Anthropic Free Online Courses

🎓 [Building with the Claude API](https://anthropic.skilljar.com/claude-with-the-anthropic-api)

🎓 [Introduction to Model Context Protocol](https://anthropic.skilljar.com/introduction-to-model-context-protocol)

Wrap-Up of Week 10

📖 Lecture

Rubrics turn "good" into measurable criteria; LLM-as-judge scores outputs cheaply; three strategies (naive / refine / aware) place the rubric at different points; iterative loop drives scores up

💻 Practice

Built a Hometown Introduction Evaluator: 5-criterion rubric, 3 strategies, LLM judge, multi-model comparison, and TWO iterative loops — threshold-based (good enough) and self-evolving (push to ceiling, refine from best, patience-based stopping)

🗣️ Discussion

Week 9 review (12 responses): Cap+Hulk coalition wins; Margareth's flexibility-uncertainty tradeoff; Huy's metacognitive clarity + Minh's Independence Threshold/Alignment Drift; Jaewhoon's "feedback loop is key" = exactly today's iterative practice

Next week: Multi-step planning — agents that decompose tasks themselves.