Chain-of-Thought Reasoning: Making LLMs Show Their Work


You ask GPT-4: “A store has 23 apples. They sell 7 in the morning and receive a shipment of 12 in the afternoon. A customer returns 3 spoiled apples from yesterday. How many apples does the store have at closing?” The model answers “31.” Wrong. It should be 31… wait, no. 23 - 7 + 12 + 3 = 31. Actually that is right - the returned apples go back to inventory. But the first time you ran this without chain-of-thought, the model said “28” because it subtracted the returned apples instead of adding them.

The difference? When you add “Let’s think through this step by step” to the prompt, the model explicitly writes out each operation: start with 23, subtract 7 gives 16, add 12 gives 28, add 3 returned gives 31. By externalizing its reasoning, the model catches the logical step that “returns” means apples coming back to the store, not leaving it.

Chain-of-thought is not just a prompting trick. It is a fundamental shift in how you use LLMs for complex tasks - trading tokens for accuracy by forcing the model to compute intermediate steps explicitly rather than jumping to conclusions.

What chain-of-thought actually is

Chain-of-thought (CoT) prompting instructs the model to produce intermediate reasoning steps before the final answer. Instead of:

Q: If a train travels 60mph for 2.5 hours, how far does it go?
A: 150 miles

You get:

Q: If a train travels 60mph for 2.5 hours, how far does it go?
A: The train travels at 60 miles per hour.
   It travels for 2.5 hours.
   Distance = speed × time = 60 × 2.5 = 150 miles.
   The answer is 150 miles.

The key insight: the model is not “thinking” differently. It is generating tokens that represent intermediate computations. These tokens become part of the context for subsequent generation, effectively giving the model scratchpad space to work through problems.

graph TD
  subgraph direct["Direct Answering"]
      D1["Question"] --> D2["Answer
(single jump)"]
  end
  subgraph cot["Chain-of-Thought"]
      C1["Question"] --> C2["Step 1: Identify given info"]
      C2 --> C3["Step 2: Determine approach"]
      C3 --> C4["Step 3: Execute calculation"]
      C4 --> C5["Step 4: Verify and answer"]
  end

  style D2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style C2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style C3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style C4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style C5 fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Why it works

Token generation as computation

LLMs are autoregressive - each new token is conditioned on all previous tokens. When the model writes “60 × 2.5 = ” as intermediate steps, those tokens become context for generating “150.” Without the intermediate tokens, the model must perform the entire computation implicitly in a single forward pass through the network. Complex reasoning requires more “compute” than a single forward pass provides.

Think of it this way: the transformer has a fixed depth (number of layers). Each layer can do a limited amount of computation. For a simple lookup (“capital of France”), one forward pass is sufficient. For multi-step reasoning, one forward pass is not enough - the model needs to “unroll” the computation across multiple token-generation steps.

Error correction opportunity

When reasoning is externalized, each step can be conditioned on the previous steps. If Step 2’s output does not logically follow from Step 1, the model has a chance to course-correct. With direct answering, there is no opportunity for the model to catch its own errors mid-stream.

Attention over intermediate results

Generated reasoning tokens are in the attention window for subsequent tokens. The model can “look back” at its earlier reasoning to maintain consistency. This is particularly important for problems requiring multiple facts to be held simultaneously.

Variants of chain-of-thought

Zero-shot CoT

Simply append “Let’s think step by step” (or similar) to your prompt. No examples needed:

Q: [complex question]
Let's think step by step.

This works surprisingly well. Google’s 2022 research showed this single phrase improves accuracy on math benchmarks by 10-40% across model sizes.

Few-shot CoT

Provide examples that include the reasoning steps:

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have?
A: Roger starts with 5 balls. He buys 2 cans of 3 balls each, so 2 × 3 = 6 new balls. Total: 5 + 6 = 11 balls.

Q: [your actual question]
A:

Few-shot CoT outperforms zero-shot CoT because the examples demonstrate the desired reasoning depth and format.

Self-consistency (majority voting)

Generate multiple chain-of-thought reasoning paths (with temperature > 0) and take the majority answer:

Path 1: 23 - 7 = 16, 16 + 12 = 28, 28 + 3 = 31 → Answer: 31
Path 2: 23 - 7 = 16, 16 + 12 = 28, 28 - 3 = 25 → Answer: 25
Path 3: 23 - 7 = 16, 16 + 12 = 28, 28 + 3 = 31 → Answer: 31
Majority: 31 ✓

Self-consistency improves accuracy by 5-15% over single-path CoT but costs N× more tokens.

Tree of thought

Explore multiple reasoning branches at each step, evaluate which branches are promising, and continue only the best ones. More powerful than linear CoT for planning and search problems, but significantly more expensive.

graph TD
  subgraph variants["CoT Variants"]
      V1["Zero-shot CoT
'Think step by step'
Cheapest, moderate gain"]
      V2["Few-shot CoT
Examples with reasoning
Best quality/cost ratio"]
      V3["Self-Consistency
Multiple paths, vote
Higher accuracy, N× cost"]
      V4["Tree of Thought
Branching exploration
Highest accuracy, expensive"]
  end

  V1 --> V2
  V2 --> V3
  V3 --> V4

  style V1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style V2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style V3 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style V4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Where chain-of-thought breaks

It does not help simple tasks

For tasks the model can already handle in one step (sentiment classification, simple extraction, translation), CoT adds tokens without improving accuracy. It can actually hurt performance by introducing unnecessary reasoning steps that lead to overthinking.

Faithful vs unfaithful reasoning

The model’s stated reasoning does not always reflect its actual computation. Research has shown cases where the model arrives at the correct answer through flawed reasoning, or states correct reasoning but produces an inconsistent final answer. The chain-of-thought is generated text, not a transparent view into the model’s internal process.

Compounding errors

If an early reasoning step is wrong, all subsequent steps build on that error. In a 5-step chain, an error in Step 2 corrupts Steps 3-5. Self-consistency helps (different paths make different errors), but single-path CoT is vulnerable to early mistakes cascading.

Token cost

CoT outputs are 3-10× longer than direct answers. For a classification task, direct answering costs 1-5 output tokens. CoT might cost 50-200. At scale, this difference matters. If you process 100K requests/day and CoT adds 150 tokens each at $15/M output tokens, that is $225/day extra.

Model size dependency

CoT primarily benefits larger models (70B+ parameters). Smaller models often produce reasoning steps that are incoherent or lead to worse answers than direct prompting. The model needs sufficient capacity to generate useful intermediate reasoning.

Real-world applications

  • OpenAI o1/o3 - trained specifically to use chain-of-thought internally, spending “thinking tokens” before responding. The reasoning is hidden but the model uses dramatically more compute per answer
  • Wolfram Alpha integration - uses CoT to decompose complex questions into sub-calculations, then calls Wolfram for exact computation
  • LangChain agents - the ReAct pattern is essentially CoT with tool calls interspersed between reasoning steps
  • Code Interpreter - breaks complex data analysis into steps: load data → inspect structure → plan analysis → write code → execute → interpret results
  • Legal AI - reasoning through case law requires explicit step-by-step analysis to maintain logical rigor across multiple precedents

How to apply CoT in practice

For math and logic: Always use CoT. The accuracy improvement is large (10-40%) and consistent. Use few-shot CoT with examples showing your expected reasoning format.

For classification: Skip CoT unless you need explainability. If you need to justify why something was classified a certain way, CoT provides the justification. Otherwise, direct classification is cheaper and equally accurate.

For planning and decomposition: Use CoT to break complex tasks into subtasks. “First, I need to… Then I can… Finally…” This is where CoT transitions into agent-like behavior.

For debugging model outputs: When the model gives wrong answers, add CoT to see where the reasoning fails. This is a diagnostic tool even if you do not use CoT in production.

Extracting the answer from CoT output: Always add explicit formatting instructions for the final answer. “After your reasoning, provide the final answer on a new line starting with ‘ANSWER:’” This makes programmatic extraction reliable.

response = model.generate(cot_prompt)
# Extract just the answer
answer = response.split("ANSWER:")[-1].strip()

FAQ

Q: If OpenAI o1 already uses internal chain-of-thought, do I still need to prompt for it?

With reasoning models (o1, o3), explicit CoT prompting is less necessary - the model already “thinks” before responding. In fact, instructing o1 to “think step by step” can be redundant or even counterproductive (double-reasoning). However, for standard models (GPT-4, Claude), explicit CoT prompting remains essential for complex reasoning tasks. Know which model type you are using and adjust accordingly.

Q: How many reasoning steps should the chain-of-thought have?

Match the steps to the problem complexity. For a 2-step math problem, 2-3 reasoning steps are appropriate. For a multi-variable optimization, 5-8 steps might be needed. Forcing too many steps on simple problems wastes tokens and can introduce errors. Forcing too few on complex problems loses accuracy. Let the examples demonstrate the appropriate depth, and the model will calibrate.

Q: Can I hide the chain-of-thought from users while still getting the accuracy benefits?

Yes. Generate the full CoT response, extract only the final answer, and display that to users. The reasoning tokens still condition the final answer, providing accuracy benefits. You pay for the extra tokens but users see a clean response. This is essentially what OpenAI o1 does - the reasoning is computed but hidden from the user.

Interview questions

Q: Your LLM-powered financial calculator gives incorrect results 15% of the time. Users input multi-step word problems about investments. How would you use chain-of-thought to improve accuracy?

Structure the CoT to mirror financial calculation steps: (1) identify all given values (principal, rate, time, compounding frequency), (2) determine which formula applies, (3) substitute values, (4) compute step by step, (5) verify the answer makes sense (sanity check). Use few-shot examples covering common financial scenarios (compound interest, loan amortization, ROI calculations). Add self-consistency with 3 paths for high-stakes calculations. For the final system: generate CoT, extract the numerical answer, and optionally verify with a deterministic calculator for pure arithmetic steps.

Q: When would chain-of-thought actually hurt performance? Give specific examples.

CoT hurts when: (1) the task is simple enough that one-step reasoning suffices - adding steps introduces noise (e.g., binary classification of obvious spam); (2) the model is too small to generate coherent reasoning - small models produce gibberish steps that lead to worse answers; (3) time/cost constraints are tight - a real-time autocomplete feature cannot wait for 200 reasoning tokens; (4) the task requires pattern matching, not reasoning - sentiment analysis on clear text, language identification, simple entity extraction. Always benchmark with and without CoT on your specific task.

Q: Design a system that uses chain-of-thought for complex customer support queries but direct answering for simple ones. How do you route between them?

Two-stage approach: First, classify the query complexity (simple lookup vs multi-step reasoning) using a fast, cheap model or a rule-based classifier (keyword matching, question length, presence of “and”/“but”/“however” suggesting multi-part questions). Simple queries get direct answers with temperature 0. Complex queries get CoT with extracted final answer. Measure: if CoT queries have higher user satisfaction and simple queries have equivalent satisfaction without CoT, the routing is working. Monitor the boundary - queries misclassified as simple but receiving bad answers should trigger routing threshold adjustment.