Rubrics, LLM-as-Judge, and Iterative Improvement Loops
Lecture, Practice, and Discussion for Week 10
How to Measure Whether an LLM Output is Good
Track the best, refine from the best, stop on no improvement
patience = 3 — wait 3 iterations for a new bestHometown Introduction — Generate, Score, Improve
evaluator.py — rubric, generation strategies, judge, loopapp.py — add Tab 8: Evaluation Labevaluator.py)Five criteria, each with an explicit definition
# evaluator.py
RUBRIC = {
"completeness": (
"Covers multiple aspects: history, culture, food, geography, and people. "
"Score 0 if only one aspect; 10 if all five are present."
),
"specificity": (
"Uses concrete details (place names, dishes, festivals) rather than generic claims. "
"Score 0 for vague platitudes; 10 for vivid specifics."
),
"structure": (
"Has logical flow with a clear opening, body, and closing. "
"Score 0 for disorganized text; 10 for well-paragraphed prose."
),
"engagement": (
"Reads as something someone would WANT to read, not a list of facts. "
"Score 0 for dry encyclopedia tone; 10 for vivid storytelling."
),
"accuracy_caution": (
"Avoids confident claims that could be wrong (specific dates, statistics). "
"Score 0 if it invents specific facts; 10 if it stays within safe knowledge."
),
}
One small llm_call helper, three different prompts
# evaluator.py — using google-genai SDK
def llm_call(client, model, prompt):
"""Single-turn call to a Gemini model."""
resp = client.models.generate_content(
model=model,
contents=prompt,
)
return resp.text
def generate_naive(client, model, hometown):
"""Strategy 1: just ask, no rubric."""
return llm_call(client, model, f"Write a 1-paragraph introduction to {hometown}.")
def generate_aware(client, model, hometown):
"""Strategy 3: include rubric in the prompt upfront."""
rubric_text = "\n".join(f"- {k}: {v}" for k, v in RUBRIC.items())
prompt = f"""Write a 1-paragraph introduction to {hometown}.
Your output will be scored on these criteria:
{rubric_text}
Address each criterion in your writing."""
return llm_call(client, model, prompt)
def generate_refine(client, model, hometown):
"""Strategy 2: generate first, then refine with rubric."""
draft = generate_naive(client, model, hometown)
rubric_text = "\n".join(f"- {k}: {v}" for k, v in RUBRIC.items())
prompt = f"""Original draft:
{draft}
Rewrite this to score better on these criteria:
{rubric_text}"""
return llm_call(client, model, prompt)
google-genai (native Gemini SDK), not OpenAI-compatibleclient.models.generate_content(model=..., contents=...) — single string in, .text outllm_call helperevaluator.py)Same rubric, structured JSON output
import json, re
def evaluate(client, judge_model, text):
"""Score text on each rubric criterion (0-10)."""
rubric_text = "\n".join(f"- {k}: {v}" for k, v in RUBRIC.items())
prompt = f"""Score the following text on each criterion (0-10 integer).
Text:
{text}
Criteria:
{rubric_text}
Return ONLY a JSON object like:
{{"completeness": 7, "specificity": 8, "structure": 6, "engagement": 7, "accuracy_caution": 9}}"""
raw = llm_call(client, judge_model, prompt)
# Extract JSON from response (LLMs sometimes wrap it in code blocks)
match = re.search(r"\{[^}]+\}", raw, re.DOTALL)
if not match:
return {k: 0 for k in RUBRIC}
try:
return json.loads(match.group(0))
except json.JSONDecodeError:
return {k: 0 for k in RUBRIC}
app.py — top of file)Standalone Streamlit app — client, model, imports
# app.py
import os
from dotenv import load_dotenv
from google import genai
load_dotenv()
client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))
model = os.environ.get("LLM_MODEL", "gemini-2.5-flash")
import streamlit as st
import pandas as pd, altair as alt
from evaluator import (
RUBRIC, generate_naive, generate_aware, generate_refine, evaluate,
iterative_threshold, iterative_self_evolving
)
streamlit run app.py — fresh single-page UILLM_MODEL env var lets you swap the generator without editing codeHeader, hometown input, judge selector, then run all three
st.header("📊 Evaluation Lab — Hometown Introductions")
hometown = st.text_input("Your hometown", "Daejeon, South Korea")
judge = st.selectbox(
"Judge model",
["gemini-3-flash-preview", "gemini-2.0-flash"],
)
if st.button("🚀 Run All 3 Strategies", disabled=not hometown):
rows = []
for name, fn in [
("Naive", generate_naive),
("Refine", generate_refine),
("Aware", generate_aware),
]:
with st.spinner(f"{name}..."):
text = fn(client, model, hometown)
scores = evaluate(client, judge, text)
st.subheader(f"{name} Strategy")
st.write(text)
st.json(scores)
for k, v in scores.items():
rows.append({"strategy": name, "criterion": k, "score": v})
df = pd.DataFrame(rows)
chart = alt.Chart(df).mark_bar().encode(
x="score:Q", y="criterion:N", color="strategy:N", row="strategy:N",
)
st.altair_chart(chart, use_container_width=True)
model) and judging (judge) — avoids self-biasevaluator.py)The simplest improvement loop — stop when good enough
def iterative_threshold(client, gen_model, judge_model, hometown,
max_iters=4, threshold=8):
"""Generate → score → refine weak criteria. Stop when all ≥ threshold."""
text = generate_aware(client, gen_model, hometown)
history = []
for i in range(max_iters):
scores = evaluate(client, judge_model, text)
avg = sum(scores.values()) / len(scores)
history.append({"iter": i, "text": text, "scores": scores, "avg": avg})
weak = [k for k, v in scores.items() if v < threshold]
if not weak:
break # all criteria above threshold → done
weak_defs = "\n".join(f"- {k}: {RUBRIC[k]}" for k in weak)
prompt = f"""Current text:
{text}
It scored low on these criteria:
{weak_defs}
Rewrite the text to specifically improve these aspects. Keep what works."""
text = llm_call(client, gen_model, prompt)
return history
app.py)Run this section by itself — independent button
# Append below the Compare-Strategies block
st.divider()
st.subheader("🎯 Threshold Loop")
th_max_iters = st.slider("Max iterations", 1, 10, 4, key="th_iters")
th_threshold = st.slider("Threshold per criterion", 5, 10, 8, key="th_th")
if st.button("🔁 Run Threshold Loop"):
with st.spinner("Iterating..."):
history = iterative_threshold(
client, model, judge, hometown,
max_iters=th_max_iters, threshold=th_threshold,
)
# Score trajectory per criterion
traj = pd.DataFrame([
{"iter": h["iter"], "criterion": k, "score": v}
for h in history for k, v in h["scores"].items()
])
line = alt.Chart(traj).mark_line(point=True).encode(
x="iter:O", y="score:Q", color="criterion:N",
)
st.altair_chart(line, use_container_width=True)
final = history[-1]
st.subheader(f"Final — iter {final['iter']}, avg {final['avg']:.1f}")
st.write(final["text"])
st.json(final["scores"])
From "good enough" to "best possible"
best)best, not from latestbest for N rounds (patience)evaluator.py)Track best, refine from best, patience-based stopping
def iterative_self_evolving(client, gen_model, judge_model, hometown,
max_iters=10, patience=3):
"""Track best, refine FROM best, stop when best hasn't improved in `patience` iters."""
text = generate_aware(client, gen_model, hometown)
history = []
best = {"avg": -1, "text": "", "scores": {}, "iter": -1}
no_improve = 0
for i in range(max_iters):
scores = evaluate(client, judge_model, text)
avg = sum(scores.values()) / len(scores)
history.append({"iter": i, "text": text, "scores": scores, "avg": avg})
if avg > best["avg"]:
best = {"avg": avg, "text": text, "scores": scores, "iter": i}
no_improve = 0
else:
no_improve += 1
if no_improve >= patience:
break # no new best for `patience` rounds → ceiling reached
# KEY: refine FROM best['text'], not from latest text
min_score = min(best["scores"].values())
weak = [k for k, v in best["scores"].items() if v <= min_score + 1]
weak_defs = "\n".join(f"- {k}: {RUBRIC[k]}" for k in weak)
prompt = f"""Current best text (avg {best['avg']:.1f}):
{best['text']}
These criteria are the weakest in the current best:
{weak_defs}
Rewrite to push these specific criteria higher. Preserve the strengths."""
text = llm_call(client, gen_model, prompt)
return history, best
best dictionary separately from textbest['text'] — not from latest (which might have regressed)app.py)Add iterative_self_evolving to the import line first
# At the top of app.py, update the import:
from evaluator import (
RUBRIC, generate_naive, generate_aware, generate_refine, evaluate,
iterative_threshold, iterative_self_evolving,
)
# Append below the Threshold Loop block
st.divider()
st.subheader("🚀 Self-Evolving Loop")
se_max_iters = st.slider("Max iterations", 3, 20, 10, key="se_iters")
se_patience = st.slider("Patience (no-improve count)", 1, 5, 3, key="se_pat")
if st.button("🔁 Run Self-Evolving Loop"):
with st.spinner("Iterating..."):
history, best = iterative_self_evolving(
client, model, judge, hometown,
max_iters=se_max_iters, patience=se_patience,
)
# Plot avg per iteration AND running best (the ratchet)
avg_df = pd.DataFrame([
{"iter": h["iter"], "metric": "avg", "score": h["avg"]} for h in history
])
running_best = []
bs = -1
for h in history:
bs = max(bs, h["avg"])
running_best.append({"iter": h["iter"], "metric": "best", "score": bs})
line = alt.Chart(pd.concat([avg_df, pd.DataFrame(running_best)])).mark_line(
point=True).encode(x="iter:O", y="score:Q", color="metric:N")
st.altair_chart(line, use_container_width=True)
# Show the BEST output (not the latest)
st.subheader(f"Best — iter {best['iter']}, avg {best['avg']:.1f}")
st.write(best["text"])
st.json(best["scores"])
patience iters → the loop stopsSame task, same judge, different generators
# Append below the Self-Evolving Loop block
st.divider()
st.subheader("🔬 Multi-Model Comparison")
candidate_models = st.multiselect(
"Models to compare",
["gemini-2.5-flash", "gemini-2.5-pro", "gemini-2.0-flash", "gemini-3-flash-preview"],
default=["gemini-2.5-flash", "gemini-2.0-flash"],
)
if st.button("⚖️ Compare Models", disabled=not candidate_models):
rows = []
for m in candidate_models:
text = generate_aware(client, m, hometown)
scores = evaluate(client, judge, text)
for k, v in scores.items():
rows.append({"model": m, "criterion": k, "score": v})
df = pd.DataFrame(rows)
chart = alt.Chart(df).mark_bar().encode(
x="score:Q", y="criterion:N", color="model:N", row="model:N",
)
st.altair_chart(chart, use_container_width=True)
1. - [ ] Define RUBRIC in evaluator.py with 5 criteria
2. - [ ] Implement generate_naive, generate_aware, generate_refine (using llm_call with google-genai SDK)
3. - [ ] Implement evaluate() with robust JSON parsing
4. - [ ] Set up app.py — .env with GOOGLE_API_KEY, genai.Client, imports
5. - [ ] Build the Compare 3 Strategies UI — does Aware score higher than Naive?
6. - [ ] Implement iterative_threshold() in evaluator.py
7. - [ ] Append the Threshold Loop UI section to app.py (independent button)
8. - [ ] Run with threshold=8 — does it stop early or hit max?
9. - [ ] Run with threshold=10 — does it now hit max iterations?
10. - [ ] Implement iterative_self_evolving() in evaluator.py
11. - [ ] Append the Self-Evolving Loop UI section to app.py
12. - [ ] Run with patience=3 — observe the avg vs best lines
13. - [ ] Compare final avg score with Stage 2 — did self-evolving find a higher peak?
14. - [ ] Multi-model comparison — which model wins per criterion?
15. - [ ] Edit one criterion's definition to be vague — see scores become unstable
Ambiguity vs Clarity — Why Vague Instructions Break Agents
12 responses analyzed — a Hulk-Captain America coalition emerges
Margareth's framing — and why "design vs demand" is incomplete
Beyond all three personas
Yadanar's concrete examples make the abstract real
Jaewhoon's "feedback loop" hits the lecture's core idea
10 minutes in pairs
1. Today you saw three strategies (naive, refine, aware) plus two loop modes (threshold, self-evolving). Which combination gave the biggest improvement on your hometown task? Connect your finding to Jaewhoon's claim that "the feedback loop is key."
2. Margareth argued that more clarity reduces the system "to something closer to conventional programming." After writing a strict rubric today, did your final output feel more constrained — and was that good or bad? Where would you draw the line between flexibility and reliability for your own research?
LLM Evaluation
📚 Anthropic: Evaluating LLM Outputs 📚 OpenAI Evals — Open-source Eval Framework 📚 LangSmith — Evaluation and Tracing
Research Papers
Anthropic Free Online Courses
Next week: Multi-step planning — agents that decompose tasks themselves.