Episode 4 — Generative AI Engineering / 4.7 — Function Calling Tool Calling

Interview Questions: Function Calling / Tool Calling

Model answers for tool calling fundamentals, when to use tool calling, deterministic invocation, hybrid logic, and building tool routers.

How to use this material (instructions)

Read lessons in order --- README.md, then 4.7.a through 4.7.e.
Practice out loud --- definition, example, pitfall.
Pair with exercises --- 4.7-Exercise-Questions.md.
Quick review --- 4.7-Quick-Revision.md.

Beginner (Q1--Q4)

Q1. What is tool calling (function calling) and how does it work?

Why interviewers ask: Tests foundational understanding of the mechanism --- many candidates confuse "the model calls a function" with what actually happens.

Model answer:

Tool calling (also called function calling) is a feature built into modern LLM APIs that lets the model decide which function to invoke and what arguments to pass, returning a structured instruction that your code executes. The model never runs the function itself --- it acts as a smart router that maps natural language to structured function calls.

The flow has six steps:

Define tools as JSON schemas (name, description, parameters).
Send the tool definitions alongside the user message to the LLM API.
Model decides --- it either returns a text response (finish_reason: "stop") or a tool_calls object (finish_reason: "tool_calls") containing the function name and arguments.
Execute the function in your code using the name and parsed arguments.
Return the result to the model via a tool role message with the matching tool_call_id.
Model generates a final natural-language response using the tool result.

The critical insight is the separation of concerns: the AI handles intent classification and argument extraction (probabilistic reasoning), while your code handles execution, validation, and business rules (deterministic logic).

// The model returns this --- it does NOT execute the function
// response.choices[0].message.tool_calls:
[
  {
    id: 'call_abc123',
    type: 'function',
    function: {
      name: 'improveBio',
      arguments: '{"currentBio":"I like hiking","tone":"witty"}'
    }
  }
]

// YOUR code executes the function:
const args = JSON.parse(toolCall.function.arguments);
const result = improveBio(args);

Q2. How does tool calling differ from prompting the model to return JSON with a function name?

Why interviewers ask: Distinguishes candidates who understand the API-level feature from those who think it is just a prompting trick.

Model answer:

Prompting for JSON (e.g., "Return {function, args} as JSON") and using the tools API parameter are fundamentally different in reliability, enforcement, and developer experience.

Aspect	Prompt for JSON	Tool Calling (API feature)
Reliability	Model may wrap JSON in text, use wrong field names, or forget fields	API guarantees a structured `tool_calls` array with valid function names
Schema enforcement	You describe the schema in natural language and hope	JSON Schema is enforced; model is constrained to your parameter definitions
Training	Model not fine-tuned for your custom format	Model is specifically fine-tuned for the tool calling response format
Finish reason	Always `"stop"` --- you must parse text to detect a function call	`finish_reason: "tool_calls"` tells you explicitly
Multi-tool	Awkward to describe multiple functions in prose	Native support for arrays of tool definitions
Parallel calls	Very difficult to get right via prompting	Built-in support for multiple simultaneous tool calls
Conversation flow	You manually fake multi-turn with role messages	Native `tool` role messages for returning results

// BAD: Prompting for JSON (fragile)
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'Return JSON: {"function":"...", "args":{...}}' },
    { role: 'user', content: 'Improve my bio: "I like dogs"' },
  ],
});
// Might return: Sure! {"function": "improveBio", ...}  <-- wrapped in text
// Might return: {"func": "improveBio", ...}            <-- wrong field name

// GOOD: Using the tools parameter (reliable)
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You are a dating app assistant.' },
    { role: 'user', content: 'Improve my bio: "I like dogs"' },
  ],
  tools,             // JSON Schema definitions
  tool_choice: 'auto',
});
// response.choices[0].message.tool_calls is ALWAYS structured

Q3. What is the `tool_choice` parameter and what are its possible values?

Why interviewers ask: Tests whether the candidate knows how to control routing behavior beyond the default.

Model answer:

The tool_choice parameter controls whether and how the model selects a tool. It has four possible values:

Value	Behavior	When to use
`'auto'`	Model decides whether to call a tool or respond with text	Default --- most common for general-purpose assistants
`'none'`	Model will NOT call any tool, even if one matches	When you want a text-only response despite tools being present
`'required'`	Model MUST call at least one tool	When you know a tool is needed (e.g., form submission)
`{ type: 'function', function: { name: 'improveBio' } }`	Model MUST call this specific tool	When you know exactly which tool to use (e.g., user clicked "Improve Bio" button)

Choosing the right tool_choice matters for cost and reliability. With 'auto', the model spends tokens deciding whether to call a tool. With 'required' or a specific function, you skip that decision and reduce latency. In production, I use 'auto' for open-ended chat and specific function forcing when the UI context makes the intent unambiguous.

// User clicked the "Improve My Bio" button --- force the tool
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  tools,
  tool_choice: { type: 'function', function: { name: 'improveBio' } },
});

Q4. Why is the `arguments` field in a tool call a string and not a parsed object?

Why interviewers ask: A subtle but important detail --- candidates who have actually built with tool calling will know this immediately.

Model answer:

The arguments field in toolCall.function.arguments is always a JSON string, not a JavaScript object. This is because the LLM generates text token by token --- what you receive is the raw serialized output. The API does not parse it for you.

This means you must always call JSON.parse() on the arguments before using them. In rare cases (especially with smaller or less capable models), the JSON string can be malformed --- a missing closing brace, a trailing comma, or an unescaped quote. Production code must handle this gracefully:

function safeParseArguments(argumentsString) {
  try {
    return { success: true, data: JSON.parse(argumentsString) };
  } catch (error) {
    return {
      success: false,
      error: `Failed to parse arguments: ${error.message}`,
      raw: argumentsString,
    };
  }
}

const toolCall = message.tool_calls[0];
const parsed = safeParseArguments(toolCall.function.arguments);

if (!parsed.success) {
  // Return the error as a tool result so the model can explain gracefully
  return {
    role: 'tool',
    tool_call_id: toolCall.id,
    content: JSON.stringify({ error: parsed.error }),
  };
}

// Safe to use parsed.data
const result = functionMap[toolCall.function.name](parsed.data);

The key principle: never call JSON.parse() without a try/catch in a tool calling pipeline. Return parse errors as tool results rather than crashing --- this lets the model explain the problem naturally to the user.

Intermediate (Q5--Q8)

Q5. Walk me through the complete message array for a tool calling round-trip. What roles are involved?

Why interviewers ask: Tests deep understanding of the conversation structure --- the message array is where most implementation bugs live.

Model answer:

A tool calling round-trip involves two API calls and four distinct message roles. Here is the exact sequence for a user asking to moderate a message:

First API call --- routing:

const messages = [
  { role: 'system', content: 'You are a dating app assistant.' },
  { role: 'user', content: 'Is "Hey call me at 555-1234" safe to send?' },
];
// + tools array + tool_choice: 'auto'

Model's response (not sent to API --- it IS the API response):

// response.choices[0].message:
{
  role: 'assistant',
  content: null,                // null when making tool calls
  tool_calls: [{
    id: 'call_xyz789',
    type: 'function',
    function: {
      name: 'moderateText',
      arguments: '{"text":"Hey call me at 555-1234"}'
    }
  }]
}

Second API call --- final response (includes the assistant message + tool result):

const followUpMessages = [
  { role: 'system', content: 'You are a dating app assistant.' },
  { role: 'user', content: 'Is "Hey call me at 555-1234" safe to send?' },
  {
    role: 'assistant',
    content: null,
    tool_calls: [{ id: 'call_xyz789', type: 'function',
      function: { name: 'moderateText', arguments: '{"text":"Hey call me at 555-1234"}' }
    }],
  },
  {
    role: 'tool',                         // Special role for tool results
    tool_call_id: 'call_xyz789',          // MUST match the tool call's id
    content: '{"safe":false,"issues":["Phone number detected"]}',  // Always a string
  },
];

Three rules that cause bugs when violated: (1) the tool_call_id in the tool message must match the id in the assistant's tool call, (2) the assistant message with tool_calls must appear before the tool result, and (3) the tool result content must be a string (use JSON.stringify() on objects).

Q6. When should you use tool calling vs structured output (JSON mode) vs plain text generation?

Why interviewers ask: Tests decision-making ability --- knowing WHEN to use a feature is more valuable than knowing HOW.

Model answer:

The decision comes down to one question: does the task require executing something outside of generating text?

Plain text generation --- use when the LLM's natural language output IS the desired result. Writing a poem, explaining a concept, translating text. No tool needed.

Structured output (JSON mode) --- use when you need the LLM's judgment in a structured format, but no function needs to execute. Sentiment classification, entity extraction, content categorization. The LLM's reasoning IS the result; you just need it in a parseable shape.

Tool calling --- use when the user's request requires a deterministic action: querying a database, calling an external API, performing precise math, enforcing business rules, or producing side effects (logging, billing, sending notifications).

Scenario	Approach	Why
"Write me a poem about autumn"	Plain text	Text IS the output
"Classify this email as spam or not"	Structured output	LLM judgment is the result
"What is my account balance?"	Tool calling	Requires database query
"Schedule a meeting for 2pm tomorrow"	Tool calling	Requires creating a calendar event
"Calculate 15% tip on $47.83 split 3 ways"	Tool calling	LLMs are unreliable at math
"Improve my dating bio"	Tool calling (hybrid zone)	Needs business rules, character limits, banned word filters

The hybrid zone is where decisions get interesting. improveBio could be pure prompting for simple cases, but becomes a tool when you need deterministic character limits, banned word filtering, premium tier checks, analytics logging, or A/B testing --- things the AI cannot reliably enforce.

Q7. Explain the hybrid logic pattern. Why not let the AI handle both routing and execution?

Why interviewers ask: Tests architectural judgment --- this is the core design principle of production AI systems.

Model answer:

The hybrid pattern divides responsibility: the AI decides WHAT to do (intent classification, argument extraction) and deterministic code decides HOW to do it (validation, business rules, database queries, side effects). Neither side handles everything alone.

Letting the AI handle everything fails in production because LLM instructions are probabilistic. If you tell the model "keep the bio under 500 characters," it will overshoot sometimes --- it cannot count precisely. If you say "filter banned words," it will miss some. If you say "only premium users get 5 calls per day," the model has no way to check the database.

// The hybrid approach for improveBio:
async function improveBio({ currentBio, tone = 'witty' }) {
  // DETERMINISTIC: Input validation
  if (currentBio.length > 2000) {
    return { error: 'Bio too long. Maximum 2000 characters.' };
  }

  // DETERMINISTIC: Banned word filter (regex, not AI guessing)
  const bannedWords = ['explicit-word'];
  if (bannedWords.some((w) => currentBio.toLowerCase().includes(w))) {
    return { error: 'Please remove inappropriate language.' };
  }

  // DETERMINISTIC: Check premium status from database
  const user = await db.getUser(userId);
  if (!user.isPremium && user.dailyUsage >= 3) {
    return { error: 'Free users: 3 improvements per day. Upgrade for unlimited!' };
  }

  // AI-ASSISTED: Creative content generation with a specialized prompt
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: `Rewrite this bio in a ${tone} tone. Max 500 chars.` },
      { role: 'user', content: currentBio },
    ],
  });
  let improved = response.choices[0].message.content;

  // DETERMINISTIC: Enforce character limit (LLM might exceed it)
  improved = improved.slice(0, 500);

  // DETERMINISTIC: Log for analytics
  analytics.track('bio_improved', { userId, tone });

  return { improved, tone, characterCount: improved.length };
}

There are three hybrid sub-patterns: (1) AI routes, code executes entirely --- the function is pure business logic like moderateText() using regex; (2) AI routes, code orchestrates AI --- the function uses another LLM call with a specialized prompt, wrapped in deterministic guardrails; (3) AI routes, code chains steps --- the function runs a multi-step pipeline combining validation, AI generation, database writes, and scoring.

Q8. How do you handle parallel tool calls? What if the model requests two tools at once?

Why interviewers ask: Tests practical implementation knowledge --- parallel calls are common and handling them incorrectly causes API errors.

Model answer:

The model can return multiple tool calls in a single response when the user's message requires more than one action. For example, "Improve my bio AND check if this message is safe" triggers both improveBio and moderateText.

You must return a result for every tool call, each referencing its specific tool_call_id. Missing or mismatched IDs cause API errors on the follow-up request.

// Model returns two tool calls:
// assistantMessage.tool_calls = [
//   { id: 'call_aaa', function: { name: 'improveBio', arguments: '...' } },
//   { id: 'call_bbb', function: { name: 'moderateText', arguments: '...' } },
// ]

// Execute all tool calls in parallel with Promise.all
const toolResults = await Promise.all(
  assistantMessage.tool_calls.map(async (toolCall) => {
    const fnName = toolCall.function.name;
    const fnArgs = JSON.parse(toolCall.function.arguments);

    if (!functionMap[fnName]) {
      return {
        role: 'tool',
        tool_call_id: toolCall.id,
        content: JSON.stringify({ error: `Unknown function: ${fnName}` }),
      };
    }

    try {
      const result = await functionMap[fnName](fnArgs);
      return {
        role: 'tool',
        tool_call_id: toolCall.id,
        content: JSON.stringify(result),
      };
    } catch (error) {
      return {
        role: 'tool',
        tool_call_id: toolCall.id,
        content: JSON.stringify({ error: error.message }),
      };
    }
  })
);

// Send ALL results back in the follow-up call
const finalResponse = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...originalMessages, assistantMessage, ...toolResults],
  tools,
});

You can also control this behavior with parallel_tool_calls: false in the API request to force the model to call tools one at a time. This is useful when tool calls have dependencies (e.g., the second tool needs the first tool's result).

Advanced (Q9--Q11)

Q9. Design a production-grade AI tool router with error handling. What are the layers of defense?

Why interviewers ask: Tests system design ability --- a production router needs much more than just dispatching to a function.

Model answer:

A production tool router has seven layers of error handling, each preventing a different failure mode:

Layer 1: Input Validation
  validateUserInput() --- reject empty, too-long, or malformed messages
  Happens BEFORE any API call (saves money on obviously bad input)

Layer 2: API Errors
  try/catch around openai.chat.completions.create()
  Handles: rate limits, auth errors, network failures, timeouts
  Strategy: retry with exponential backoff, then fallback message

Layer 3: Argument Parsing
  safeParseArguments() --- catches malformed JSON from the model
  Returns error as a tool result so the model can explain gracefully

Layer 4: Unknown Functions
  handlerMap[fnName] check --- catches hallucinated function names
  Returns error listing available functions

Layer 5: Function Execution
  try/catch around handlerMap[fnName](args)
  Catches: database errors, API failures, validation errors within the handler
  Returns error as a tool result for graceful degradation

Layer 6: Result Validation
  validateToolResult() --- truncates oversized results to prevent token budget overflow
  Logs a warning when truncation occurs

Layer 7: Final Response Fallback
  If the second LLM call (formatting) fails, return raw tool results
  User still gets useful information even when formatting breaks

Beyond error handling, the router needs: logging (every tool call with name, args, result, duration, success/failure), rate limiting (per-user, per-tool), context-aware tool selection (premium users see more tools), and retry logic with exponential backoff.

async function executeToolCall(toolCall) {
  const fnName = toolCall.function.name;

  // Layer 4: Unknown function
  if (!handlerMap[fnName]) {
    return {
      role: 'tool',
      tool_call_id: toolCall.id,
      content: JSON.stringify({
        error: `Unknown function: ${fnName}. Available: ${Object.keys(handlerMap).join(', ')}`,
      }),
    };
  }

  // Layer 3: Argument parsing
  const parsed = safeParseArguments(toolCall.function.arguments);
  if (!parsed.success) {
    return {
      role: 'tool',
      tool_call_id: toolCall.id,
      content: JSON.stringify({ error: parsed.error }),
    };
  }

  // Layer 5: Function execution
  try {
    const result = await handlerMap[fnName](parsed.data);

    // Layer 6: Result validation
    const validated = validateToolResult(result, 4000);
    return {
      role: 'tool',
      tool_call_id: toolCall.id,
      content: validated.valid ? validated.serialized : validated.truncated,
    };
  } catch (error) {
    return {
      role: 'tool',
      tool_call_id: toolCall.id,
      content: JSON.stringify({ error: `Execution failed: ${error.message}` }),
    };
  }
}

The key principle across all layers: return errors as tool results, not thrown exceptions. This lets the model explain the problem naturally to the user instead of the application crashing.

Q10. How do you optimize the cost of tool calling at scale? Walk through a cost analysis.

Why interviewers ask: Tests production awareness --- tool calling multiplies LLM costs, and managing this is a real engineering challenge.

Model answer:

Tool calling has three cost drivers that compound: (1) tool definitions consume tokens on every request, (2) each tool-call interaction requires at least two LLM calls (routing + final response), and (3) hybrid functions that use AI internally add a third call.

Cost breakdown for a single improveBio interaction:

Call 1: Routing decision
  Input:  system prompt + user message + 3 tool schemas = ~800 tokens
  Output: tool_call JSON = ~50 tokens
  Cost:   800 * $2.50/1M + 50 * $10/1M = ~$0.0025

Call 2: Bio generation (inside the improveBio function)
  Input:  specialized prompt + bio = ~200 tokens
  Output: improved bio = ~100 tokens
  Cost:   200 * $2.50/1M + 100 * $10/1M = ~$0.0015

Call 3: Final response formatting
  Input:  full conversation + tool result = ~1,000 tokens
  Output: formatted message = ~150 tokens
  Cost:   1,000 * $2.50/1M + 150 * $10/1M = ~$0.004

Total: ~$0.008 per interaction (3 LLM calls)
At 100,000 interactions/day: ~$800/day

Tool definition overhead alone:

3 tools * ~150 tokens each = 450 tokens per call
450 tokens * 100,000 calls/day = 45M tokens/day
45M * $2.50/1M = $112.50/day JUST for tool schemas

Optimization strategies:

Use a cheaper model for routing. GPT-4o-mini is accurate enough for intent classification at a fraction of the cost. Use GPT-4o only for content generation inside functions.
Only include relevant tools. If the UI context tells you the user is on the bio improvement page, send only the improveBio tool instead of all five.
Use tool_choice: 'required' or a specific function when the intent is unambiguous. This skips the "should I call a tool?" decision, reducing output tokens.
Cache tool results. If the same bio text with the same tone was improved recently, return the cached result.
Combine routing and formatting. When the tool result is simple, you can sometimes skip the third LLM call and format the result in code.
Keep descriptions concise. Every extra word in a tool description costs tokens on every call.

Q11. The model sometimes routes to the wrong tool. How do you diagnose and fix routing accuracy without changing the model?

Why interviewers ask: Tests debugging methodology for a non-deterministic system --- a critical skill for AI engineering.

Model answer:

Routing accuracy problems come from three sources: ambiguous tool descriptions, overlapping tool scopes, and ambiguous user messages. The fix is always in your tool definitions and system prompt, not the model.

Step 1: Build a routing test suite. Define test cases with expected tool outcomes and run them at temperature: 0 for determinism:

const testCases = [
  { input: 'Make my bio better: "I like coffee"', expectedTool: 'improveBio' },
  { input: 'Help me message someone who likes yoga', expectedTool: 'generateOpeners' },
  { input: 'Is "call me at 555-1234" okay to send?', expectedTool: 'moderateText' },
  { input: 'Any tips for better photos?', expectedTool: 'getProfileTips' },
  { input: 'Thanks for the help!', expectedTool: null },
];

for (const tc of testCases) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: SYSTEM_PROMPT },
      { role: 'user', content: tc.input },
    ],
    tools,
    tool_choice: 'auto',
    temperature: 0,
  });

  const actual = response.choices[0].finish_reason === 'tool_calls'
    ? response.choices[0].message.tool_calls[0].function.name
    : null;

  console.log(actual === tc.expectedTool ? 'PASS' : 'FAIL', tc.input);
}

Step 2: Fix tool descriptions. The model relies heavily on the description field. Make descriptions mutually exclusive with specific trigger phrases:

// BAD: Overlapping descriptions
'Improve a user profile'        // improveBio
'Give advice about profiles'    // getProfileTips
// Model cannot distinguish these

// GOOD: Specific, non-overlapping descriptions
'Rewrite a dating profile BIO text to be more engaging. Call when the user ' +
'says "improve my bio", "rewrite my profile text", "make my bio better". ' +
'NOT for general advice.'

'Get general tips about dating profiles, photos, or messaging strategy. ' +
'Call when the user asks "any tips?", "how do I improve?", "what makes a ' +
'good profile?". NOT for rewriting specific bio text.'

Step 3: Reduce tool count. More tools means more ambiguity. Consolidate overlapping tools. Five well-scoped tools outperform fifteen narrow ones.

Step 4: Use the system prompt to clarify boundaries. Add explicit routing guidance: "When the user provides specific bio text to rewrite, use improveBio. When they ask for general advice without providing text, use getProfileTips."

Step 5: Monitor in production. Log every routing decision with the user message. Review misrouted cases weekly. Add the failure cases to your test suite. Over time, your test suite becomes a regression safety net.

Quick-fire

#	Question	One-line answer
1	Does the model execute the function?	No --- it returns the function name and arguments; your code executes
2	What is the `finish_reason` when a tool is called?	`"tool_calls"` (vs `"stop"` for text responses)
3	`tool_calls[0].function.arguments` is what type?	JSON string --- must `JSON.parse()` before use
4	`tool_choice: 'required'` means?	Model must call at least one tool (cannot respond with text only)
5	Why return errors as tool results instead of throwing?	Model can explain the problem naturally to the user; app does not crash
6	Old `functions` param vs current `tools` param?	`functions` is deprecated; `tools` is the current standard
7	What is the hybrid principle?	AI decides WHAT to do; code decides HOW to do it
8	How many LLM calls per tool-call interaction?	2 minimum (routing + final response); 3 if the function uses AI internally
9	How to reduce tool definition token cost?	Include only relevant tools; keep descriptions concise
10	Tool result `content` must be?	A string --- `JSON.stringify()` objects before returning
11	`parallel_tool_calls: false` does what?	Forces the model to call tools one at a time instead of simultaneously

<- Back to 4.7 --- Function Calling / Tool Calling (README)

Interview Questions: Function Calling / Tool Calling

How to use this material (instructions)

Beginner (Q1--Q4)

Q1. What is tool calling (function calling) and how does it work?

Q2. How does tool calling differ from prompting the model to return JSON with a function name?

Q3. What is the tool_choice parameter and what are its possible values?

Q4. Why is the arguments field in a tool call a string and not a parsed object?

Intermediate (Q5--Q8)

Q5. Walk me through the complete message array for a tool calling round-trip. What roles are involved?

Q6. When should you use tool calling vs structured output (JSON mode) vs plain text generation?

Q7. Explain the hybrid logic pattern. Why not let the AI handle both routing and execution?

Q8. How do you handle parallel tool calls? What if the model requests two tools at once?

Advanced (Q9--Q11)

Q9. Design a production-grade AI tool router with error handling. What are the layers of defense?

Q10. How do you optimize the cost of tool calling at scale? Walk through a cost analysis.

Q11. The model sometimes routes to the wrong tool. How do you diagnose and fix routing accuracy without changing the model?

Quick-fire

Q3. What is the `tool_choice` parameter and what are its possible values?

Q4. Why is the `arguments` field in a tool call a string and not a parsed object?