Function Calling

How AI Agents Take Action — Not Just Talk

Week 9 of Phase 3: Advanced Patterns (Weeks 9-12)

Contents

Lecture, Practice, and Discussion for Week 9

📖 1. Lecture
  • The limit of "text in, text out" LLMs
  • Function calling — letting AI invoke real code
  • The agentic loop: call → observe → decide → repeat
💻 2. Practice
  • Build a research agent with 3 tools: search, fetch, summarize
  • The agent decides which tool to call based on user question
🗣️ 3. Discussion
  • Designing tool definitions clearly
  • Midterm reflection: vague specs → flaky agents

Part 1: Lecture

Function Calling — Giving AI the Power to Act

The Story So Far — From Talking to Acting

📄 What We've Built So Far
  • Week 5: AI answers questions about papers
  • Week 6: AI extracts metadata into structured form
  • Week 7: AI debates topics with personas
  • All of these = AI produces text as output
⚠️ The Limitation
  • The LLM only outputs text — it can't actually DO anything
  • It can't search a database, call an API, read a file, run code
  • You had to write all the surrounding logic yourself
🚀 Today's Shift
  • Give the LLM access to real functions
  • Let it decide which function to call, with what arguments
  • Now the LLM can act, not just describe actions

The Core Problem — LLMs Can't Do Anything

What an LLM Can't Do
  • It can't check today's weather
  • It can't search your local files
  • It can't run a calculation reliably (23847 × 9281 = ?)
  • It can't query a database
  • It only generates text based on training data
💡 What If We Gave It Tools?
  • "Here's a get_weather(city) function"
  • "Here's a search_papers(query) function"
  • "Here's a calculate(expr) function"
  • The LLM still outputs text — but the text says "call this function with these arguments"
The Result
  • The LLM becomes an orchestrator — it picks the right tool
  • The actual work happens in your real code
  • Reliable, verifiable, and the LLM stays in its lane

What is Function Calling?

A simple but powerful API pattern

1. You define functions in JSON:
   { name: "search_papers", description: "...", parameters: {...} }

2. You send the user message + function definitions to the LLM

3. The LLM responds with EITHER:
   (a) A normal text answer, OR
   (b) "I want to call search_papers with query='deep learning'"

4. If (b), YOUR code runs the real function and gets the result

5. You send the result back to the LLM → it forms the final answer
"The LLM never actually runs code. It just tells YOU which function to run, and YOU run it."

The Flow — Visualized

sequenceDiagram participant U as 👤 User participant A as 💻 Your App participant L as 🤖 LLM participant F as 🔧 Real Function U->>A: "Find recent GNN papers" A->>L: message + tool definitions L-->>A: tool_call: search_papers(query="GNN") A->>F: search_papers("GNN") F-->>A: [paper1, paper2, paper3] A->>L: tool_result: [...] L-->>A: "I found 3 papers: ..." A-->>U: Final answer

A Tool Definition — JSON Schema

This is what you send to the LLM

tool = {
    "type": "function",
    "function": {
        "name": "search_papers",
        "description": "Search the local paper collection by keyword.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Keyword or topic to search for"
                },
                "max_results": {
                    "type": "integer",
                    "description": "How many papers to return (default 5)"
                }
            },
            "required": ["query"]
        }
    }
}
💡 Three Things the LLM Sees
  • name — identifier the LLM uses to "call" the function
  • description — when should the LLM use this? (CRITICAL)
  • parameters — what arguments are needed (JSON schema)

The Description is Where Specifying Lives

This is the new prompt engineering

Vague Description
  • "description": "search papers"
  • LLM has no idea when to call this vs other tools
  • It might call it for "what is deep learning?" (wrong — that's general knowledge)
Clear Description
  • "description": "Search the user's local paper collection (extracted via Week 6 metadata). Use this when the user asks about specific papers or trends in THEIR collection. Do NOT use for general knowledge questions."
  • LLM now knows exactly when to call it
  • The description is your contract with the LLM
"Tool description = clearer specification. The skill from Week 8's midterm reflection — directly applied here."

The Agentic Loop — Multiple Tool Calls

One question may need several tools

🔄 The Pattern
  • User asks a question
  • LLM thinks → calls Tool A
  • Sees Tool A's result → decides to call Tool B
  • Sees Tool B's result → maybe calls Tool C
  • Eventually: enough info → produces final answer
📝 Concrete Example
  • User: "Summarize the top 3 GNN papers in my collection"
  • LLM call 1: search_papers(query="GNN", max_results=3) → gets 3 papers
  • LLM call 2: get_paper_details(id=1) → gets full text of paper 1
  • LLM call 3: get_paper_details(id=2) → gets full text of paper 2
  • LLM call 4: get_paper_details(id=3) → gets full text of paper 3
  • LLM final: synthesizes summary across all 3
⚠️ Important
  • You loop in your code: while LLM keeps calling tools, run them
  • Set a max iteration count to prevent infinite loops
  • Each step is observable — you see exactly what was called

The Agentic Loop in Code

This is the heart of every agent system

messages = [{"role": "user", "content": user_question}]

for step in range(MAX_STEPS):
    resp = client.chat.completions.create(
        model=model,
        messages=messages,
        tools=tool_definitions,
    )
    msg = resp.choices[0].message
    messages.append(msg)

    if not msg.tool_calls:
        # LLM gave a final answer — done
        return msg.content

    # LLM wants to call tools — run them
    for tc in msg.tool_calls:
        result = run_tool(tc.function.name, tc.function.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "content": str(result),
        })
💡 Read This Loop Carefully
  • It alternates: LLM call → run tools → LLM call → run tools → ...
  • Exits when LLM stops requesting tools (gives final answer)
  • MAX_STEPS prevents runaway loops

When to Use Function Calling

Good Fit
  • The task needs real-world data (DB, API, files, calculations)
  • The task involves multiple specialized steps (search → fetch → summarize)
  • You need deterministic parts mixed with LLM reasoning
  • You want the LLM's actions to be inspectable
Not a Good Fit
  • Pure text transformation (translate, summarize, rewrite) → just prompt
  • Single-shot Q&A from training data → just prompt
  • Tasks where you don't trust the LLM to choose tools wisely → fixed pipeline
🤔 Function Calling vs Fixed Pipeline
  • Fixed pipeline (Week 7-style): you write the order, LLM fills in
  • Function calling: LLM decides the order itself
  • Function calling = more flexible, but harder to predict

Common Pitfalls — What Goes Wrong

1️⃣ Vague Tool Descriptions
  • LLM picks the wrong tool, or no tool when one was needed
  • Fix: be explicit about WHEN to use each tool, with examples
2️⃣ Too Many Tools
  • 20 tools → LLM gets confused, picks randomly
  • Fix: keep it under ~7 tools, group related ones
3️⃣ Infinite Loops
  • LLM keeps calling tools forever
  • Fix: set MAX_STEPS = 10 and bail out gracefully
4️⃣ No Validation
  • LLM passes garbage arguments — your function crashes
  • Fix: validate arguments inside each tool function, return error messages the LLM can read

Lecture Summary — Function Calling

🎯 The Core Idea
  • LLM outputs structured JSON saying "call this function with these arguments"
  • YOUR code runs the function and returns the result
  • LLM sees the result and decides what to do next
🔄 The Agentic Loop
  • Loop until LLM stops requesting tools
  • Each tool call is observable — full transparency
  • Set MAX_STEPS to prevent runaway behavior
📝 The Skill That Matters
  • Tool descriptions = where you specify WHEN to use each tool
  • Vague descriptions → LLM picks the wrong tool
  • Clear descriptions → LLM behaves predictably
  • This is specifying skill, applied to tool design

Part 2: Practice

Build a Research Agent with 3 Tools

Practice Overview — What We'll Build

🎯 The Goal
  • A chat interface where the user asks research questions
  • The agent has 3 tools for accessing the paper collection
  • The agent decides which tools to call, in what order
  • The chat shows every tool call so users can see what's happening
🔧 The Three Tools
  • search_papers(query, max_results) — keyword search in metadata
  • get_paper_details(paper_id) — full info for one paper
  • count_papers_by_year(year) — simple stats query
📁 Files
  • tools.py — the 3 tool functions + their JSON schemas
  • agent.py — the agentic loop
  • app.py — add Tab 7 (Agent Chat)

Step 1 — Tool Functions (tools.py)

Three real Python functions the agent can call

# tools.py
import json
from pdf_to_md import load_all_metadata
MD_DIR = "md_output"

def search_papers(query: str, max_results: int = 5) -> str:
    """Keyword search in titles and abstracts."""
    papers = load_all_metadata(MD_DIR)
    q = query.lower()
    hits = []
    for i, p in enumerate(papers):
        text = (p.get("title", "") + " " + p.get("abstract", "")).lower()
        if q in text:
            hits.append({"id": i, "title": p.get("title", "?")})
        if len(hits) >= max_results:
            break
    return json.dumps(hits)

def get_paper_details(paper_id: int) -> str:
    """Full metadata for a single paper by id."""
    papers = load_all_metadata(MD_DIR)
    if 0 <= paper_id < len(papers):
        return json.dumps(papers[paper_id])
    return json.dumps({"error": f"paper_id {paper_id} not found"})

def count_papers_by_year(year: int) -> str:
    """Count papers published in a given year."""
    papers = load_all_metadata(MD_DIR)
    count = sum(1 for p in papers if str(p.get("year", "")) == str(year))
    return json.dumps({"year": year, "count": count})

Step 2 — Tool Schemas (tools.py)

How the LLM sees each tool

TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": "search_papers",
            "description": (
                "Search the user's local paper collection by keyword. "
                "Use when the user asks about a specific topic in THEIR papers."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search keyword"},
                    "max_results": {"type": "integer", "description": "Default 5"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_paper_details",
            "description": "Get full metadata for one paper by its id (from search).",
            "parameters": {
                "type": "object",
                "properties": {"paper_id": {"type": "integer"}},
                "required": ["paper_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "count_papers_by_year",
            "description": "Count how many papers in the collection were published in a given year.",
            "parameters": {
                "type": "object",
                "properties": {"year": {"type": "integer"}},
                "required": ["year"]
            }
        }
    }
]

TOOL_FUNCS = {
    "search_papers": search_papers,
    "get_paper_details": get_paper_details,
    "count_papers_by_year": count_papers_by_year,
}

Step 3 — Agent Loop (agent.py)

The heart of any function-calling agent

# agent.py
import json
from tools import TOOL_SCHEMAS, TOOL_FUNCS

MAX_STEPS = 8

def run_agent(client, model, user_question, on_step=None):
    """Run the agentic loop. on_step(event) is called for each tool call."""
    messages = [
        {"role": "system", "content":
         "You are a research assistant with access to the user's paper collection. "
         "Use tools to answer questions about their papers. "
         "When unsure which paper, use search_papers first."},
        {"role": "user", "content": user_question},
    ]

    for step in range(MAX_STEPS):
        resp = client.chat.completions.create(
            model=model, messages=messages, tools=TOOL_SCHEMAS,
        )
        msg = resp.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            return msg.content  # final answer

        for tc in msg.tool_calls:
            name = tc.function.name
            args = json.loads(tc.function.arguments)
            if on_step:
                on_step({"type": "call", "name": name, "args": args})
            try:
                result = TOOL_FUNCS[name](**args)
            except Exception as e:
                result = json.dumps({"error": str(e)})
            if on_step:
                on_step({"type": "result", "name": name, "result": result})
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result,
            })

    return "Stopped — too many steps."

Step 4 — Streamlit UI (app.py Tab 7)

Show every tool call in real time

# app.py — Tab 7 (Agent Chat)
from agent import run_agent

with tab7:
    st.header("🔧 Research Agent (Function Calling)")
    question = st.text_input(
        "Ask about your papers",
        placeholder="e.g., How many GNN papers do I have from 2024?"
    )

    if st.button("▶️ Run Agent", disabled=not question):
        events = []

        def on_step(event):
            events.append(event)

        with st.spinner("Agent thinking..."):
            answer = run_agent(client, model, question, on_step=on_step)

        # Show every tool call (transparency)
        st.subheader("🔍 Tool Calls")
        for e in events:
            if e["type"] == "call":
                st.markdown(f"📞 **{e['name']}**(`{e['args']}`)")
            else:
                st.markdown(f"📦 result: `{e['result'][:200]}...`")

        st.subheader("💬 Final Answer")
        st.markdown(answer)

Step 5 — Try These Questions

See the agent pick different tools

Question 1: "How many papers from 2024 do I have?"
   → Agent calls count_papers_by_year(2024) → answers

Question 2: "Find papers about graph neural networks"
   → Agent calls search_papers("graph neural networks") → lists them

Question 3: "Tell me about the first GNN paper in detail"
   → Agent calls search_papers("GNN") → then get_paper_details(id=X)
   → synthesizes a summary

Question 4: "Compare 2023 vs 2024 paper counts"
   → Agent calls count_papers_by_year(2023)
   → then count_papers_by_year(2024)
   → presents both numbers
💡 What to Watch For
  • Does the agent pick the right tool?
  • Does it call multiple tools when needed?
  • Does it sometimes fail to use a tool when it should?

Week 9 Practice Checklist

Complete these steps during the practice session:

1. - [ ] Create tools.py with the 3 functions and their schemas

2. - [ ] Create agent.py with the run_agent() loop

3. - [ ] Add Tab 7 to app.py showing tool calls in real time

4. - [ ] Test with the 4 example questions — does the agent pick the right tools?

5. - [ ] Try a question the agent should NOT use a tool for (e.g., "What is deep learning?") — does it answer directly?

6. - [ ] Bonus: edit one tool's description to be vague, see how behavior changes

7. - [ ] Bonus: add a 4th tool of your choice (e.g., list_authors())

Part 3: Discussion

Authorship & Accountability — Who Owns the AI's Output?

Week 7 — "Who is Responsible for a Flawed AI Hypothesis?"

14 responses analyzed — a near-universal consensus, with one twist

📊 The Vote Count
  • All three (1+2+3 combined): Huy, Waad, Nazhiefah, DongYun, Minh — 5 votes
  • Hulk only (3): Namcheol, Gyeongsu, Han, Hyunwoo — 4 votes
  • Captain America only (2): Yadanar, Ly, Seher — 3 votes
  • Captain + Hulk (2,3): Irfan — 1 vote
  • Iron Man only (1): Tan — 1 vote
🤝 The Universal Consensus
  • Almost everyone says: the human is fully accountable, period
  • "AI cannot bear moral or scientific accountability" appears in nearly every response
  • Even Iron Man supporters don't excuse the human — they just emphasize speed
🔥 The Real Disagreement
  • The split is NOT about who is responsible
  • The split is about how to manage AI to fulfill that responsibility
  • This is a much more productive disagreement

Theme 1 — The Real Disagreement is HOW, Not WHO

Three management philosophies emerge

🚀 The Velocity Camp (Iron Man side)
  • Tan: "Define problem, select data, choose model — the researcher is in control at every step"
  • Huy: "AI provides the velocity, but the human provides the vector"
  • Position: embrace speed, but never abdicate decisions
🛡️ The Integrity Camp (Captain America side)
  • Yadanar: "Relying on AI without careful judgment weakens scientific integrity"
  • Ly: "If results are wrong, we don't blame the software — we check our model"
  • Seher: "Use multiple AI systems to cross-check, but final judgment must be human"
  • Position: keep deep engagement to maintain accountability
🔬 The Verification Camp (Hulk side)
  • Hyunwoo: "A flawed hypothesis can result in severe hardware damage"
  • Gyeongsu: "AI hallucinated fake research papers in my coding work"
  • Han: "Researchers must never blindly trust AI; must personally verify every result"
  • Position: build active verification systems around the AI
"The real question isn't 'is the AI responsible?' (no), but 'what does taking responsibility actually look like in practice?'"

Theme 2 — The Tool Metaphors

How students framed the AI's role

🔨 "AI is a Tool" — Most Common Frame
  • DongYun: "AI is merely a tool, like a hammer"
  • Tan: "AI is fundamentally a tool without intent or accountability"
  • Minh: "We do not credit the power drill for the architecture"
  • The hammer/drill metaphor is intuitive but limited — tools don't suggest what to build
⚛️ "AI is a High-Power Instrument" — Namcheol's Reframe
  • "A nuclear engineer doesn't blame the reactor if a control rod is miscalibrated"
  • But the engineer DOES need active containment, monitoring, documentation
  • This metaphor better captures the operational risk dimension
🧭 "Velocity vs Vector" — Huy's New Frame
  • AI provides the velocity (speed of generating ideas)
  • Human provides the vector (direction and validation)
  • Captures both the productivity gain AND the irreplaceable human role

Theme 3 — Real Engineering Stakes

Why this isn't an abstract debate

⚠️ Concrete Failures Students Have Seen or Worry About
  • Gyeongsu: AI fabricated fake paper citations — would have been "deeply embarrassing" if used in a presentation
  • Hyunwoo: "A flawed hypothesis regarding system dynamics can result in severe hardware damage"
  • Minh: Motor winding design — "AI-generated hypothesis is merely a high-probability suggestion that must undergo empirical validation"
  • Waad: "Dangerous trend where the ease of a single button press erodes critical thinking in younger generations"
🔍 The Pattern
  • These aren't theoretical concerns — students are encountering them in their own work
  • The verification step isn't a luxury; it's how you avoid catastrophic failure
  • Engineering domains (robotics, motors, hardware) have physical consequences for AI errors
"AI changes how ideas are generated. It does not change who is responsible for them." — Huy

How This Connects to Today's Practice

Function calling makes accountability MECHANICAL, not just philosophical

🔧 Tool Calls = Accountability Made Visible
  • Today's agent shows EVERY tool call in the UI
  • You see what the agent searched for, what it retrieved, what it computed
  • This is operational accountability — not just claimed, but observable
🎯 Tool Descriptions = Where You Specify Limits
  • Want to constrain the AI? Write narrow tool descriptions
  • "Use ONLY when..." in the description prevents misuse
  • This is the verification camp's discipline, baked into the system design
🔁 The Velocity-Vector Pattern in Code
  • Tools provide the velocity (search 1000 papers in seconds)
  • Tool definitions and the loop's exit conditions provide the vector
  • You designed the rails the agent runs on — that's where your accountability lives

Activity — Design a Tool with Built-In Accountability

Apply the lessons in pairs (10 min)

📋 The Task
  • Pick any tool from your own research workflow you'd want to automate
  • Write its description field with explicit limits built in
  • Show your description to a partner — can they tell when the LLM should/shouldn't use it?
✏️ Required Elements in Your Description
  • What the tool does (1 sentence)
  • When to use it (specific scenarios)
  • When NOT to use it (boundaries)
  • What outputs to verify (the human-in-the-loop checkpoint)
Example
  • "Generate a hypothesis from input data. Use ONLY for preliminary brainstorming on small datasets (<1000 rows). Do NOT use for high-stakes claims. OUTPUT MUST BE VALIDATED against at least one independent baseline before being included in any report."

🗣️ Week 9 Discussion Questions (UST LMS)

Visit: UST LMS → Class → Discussion

1. Several classmates (Huy, Namcheol, Minh) reframed the AI accountability debate using metaphors — "velocity vs vector", "high-power instrument", "power drill". Which metaphor best captures how YOU work with AI in your research? Or propose a better one.

2. In today's practice, you saw every tool call the agent made (transparency). Does this kind of visibility change your sense of responsibility for the AI's output? Would you trust an AI more or less if you couldn't see its tool calls?

Want to Learn More?

Function Calling Documentation

📚 OpenAI: Function Calling Guide 📚 Anthropic: Tool Use Overview 📚 Anthropic: Building Effective Agents

 

Agent Frameworks (built on function calling)

📚 LangGraph — Stateful Agent Workflows 📚 OpenAI Agents SDK

 

Anthropic Free Online Courses

🎓 [Building with the Claude API](https://anthropic.skilljar.com/claude-with-the-anthropic-api)
🎓 [Introduction to Model Context Protocol](https://anthropic.skilljar.com/introduction-to-model-context-protocol)
🎓 [Introduction to Agent Skills](https://anthropic.skilljar.com/introduction-to-agent-skills)

Wrap-Up of Week 9

📖 Lecture
  • Function calling = LLM outputs JSON describing which function to call; YOUR code runs it; result fed back; loop until done
💻 Practice
  • Built a research agent with 3 tools (search, fetch, count); the agent picks tools based on the question; every call is visible to the user
🗣️ Discussion
  • Week 7 review: universal consensus that humans are accountable; the real disagreement is HOW (velocity / integrity / verification camps); Huy's "velocity vs vector" frame connects accountability to today's tool design

Next week: What happens when tool calls fail — error handling and recovery in agent loops.

×