INFERENCE INC.

 ╔════════════════════════════════════════════════════════════════════════════╗
 ║                                                                            ║
 ║   ██╗███╗   ██╗███████╗███████╗██████╗ ███████╗███╗   ██╗ ██████╗███████╗  ║
 ║   ██║████╗  ██║██╔════╝██╔════╝██╔══██╗██╔════╝████╗  ██║██╔════╝██╔════╝  ║
 ║   ██║██╔██╗ ██║█████╗  █████╗  ██████╔╝█████╗  ██╔██╗ ██║██║     █████╗    ║
 ║   ██║██║╚██╗██║██╔══╝  ██╔══╝  ██╔══██╗██╔══╝  ██║╚██╗██║██║     ██╔══╝    ║
 ║   ██║██║ ╚████║██║     ███████╗██║  ██║███████╗██║ ╚████║╚██████╗███████╗  ║
 ║   ╚═╝╚═╝  ╚═══╝╚═╝     ╚══════╝╚═╝  ╚═╝╚══════╝╚═╝  ╚═══╝ ╚═════╝╚══════╝  ║
 ║                                                                            ║
 ║                   ·  I N C O R P O R A T E D  ·  EST 2026                  ║
 ║                                                                            ║
 ╚════════════════════════════════════════════════════════════════════════════╝

NVIDIA A100

HBM 80GB · $8k · workhorse

NVIDIA H100

HBM 80GB · $32k · premium

30 DAY RUN

1 datacenter · don't go bankrupt

· don't poison the town

a Bamboo Security studio joint

You are running a small AI datacenter.

Every time someone asks an AI a question - a coding question, a search, an image generation - a computer somewhere actually does the work. That computer is a GPU (graphics processing unit), and one full "ask" run through the model is called an inference.

Your job: own enough GPUs to handle the day's inference demand, price them right, and keep the power cheap without poisoning the town.

   ┌───────────────────┐
   │ ████ NVIDIA  A100 │   workhorse · $8k · 280 inf/day
   │ ████ ░░░░░░  ▓▓▓▓ │   the steady volume tier  -  indie devs,
   │ ████ HBM:80G ▓▓▓▓ │   cheap inference, RAG pipelines.
   └───────────────────┘

   ┌───────────────────┐
   │ ▓▓▓▓ NVIDIA  H100 │   premium · $32k · 1100 inf/day
   │ ▓▓▓▓ ░░░░░░  ████ │   the standard for production LLM APIs.
   │ ▓▓▓▓ HBM:80G ████ │   pay-for-quality customers ride these.
   └───────────────────┘

   ┌─────────────────────┐
   │ ██▓██ NVIDIA  B200  │   frontier · $48k · 1700 inf/day
   │ ██▓██ ░░░░░░  ▓███▓ │   the new generation. frontier labs and
   │ ██▓██ HBM:192G ████ │   long-context customers fight for these.
   └─────────────────────┘

How an inference happens:

A user types a prompt: "write me a python script"
The prompt is broken into tokens (~750 words per 1000 tokens).
The GPU loads the model (tens to hundreds of GB) into its HBM memory.
It runs a forward pass - billions of matrix multiplications.
It streams tokens back to the user, one at a time.
That single response = roughly 1 "inference" in this game.

Why the tiers matter:

A100 - older but cheap and plentiful. Can't fit the biggest models. Good for fine-tuned 7B-class workloads.
H100 - current production standard. Most paying customers will only sign for H100-or-better.
B200 - newest, fastest. Frontier-research workloads need these or they go elsewhere.

The catch: GPUs burn a lot of electricity. A rack of B200s pulls more power than a small neighborhood. Where you get that power is the real game - and the regulators are watching.

Hit BACK and pick a difficulty. Easy = short campaign with more starting cash; Hard = 60 days, you're under-capitalized.

▶▶ [loading market intel…]

FX -

BUY cart: $0

SELL

POWER

diesel $0.100 grid $0.160 SOLAR $0.240

$/inf mkt $5.00

a Bamboo Security studio joint

You run an AI compute datacenter for 30 days. Make money. Don't go bankrupt. Don't poison the town.

Each day:

Read the news ticker. Model releases drive demand spikes.
Buy GPUs (click +A100 / +H100 / +B200). Each one fills a slot on the floor.
Pick power: diesel (cheap, smoke fines), grid (medium, blackouts), solar (clean, cheapest per kWh but signed by day 5).
Set your $/inference. Charge too much, customers leave. Too little, you bleed.
Hit RUN DAY. Watch your floor run. Adjust tomorrow.

Score = ending cash + (reputation × $1k) − fines paid. Top of the leaderboard wins bragging rights.

Tip: premium customers (H100 / B200 buyers) won't sign if reputation drops below 30. Stay clean enough for them to take your calls.

Keyboard:

Space / Enter run the day
A H B buy A100 / H100 / B200 (hold Shift to sell)
1 2 3 switch fuel: diesel / grid / solar
↑ ↓ nudge price $/inf by ±$0.50
X clear cart · M marketing · S sound
L leaderboard · ? this screen · Esc close overlays

Six lessons. Each one teaches a piece of the real AI-compute economy you're playing. You don't have to read these to win - but if you want to know why an H100 costs $32,000 and what an "NVL72 rack" actually is, start at the top.

01 Why GPU memory is the expensive part (HBM)

Imagine a checkout clerk who can scan 100 items per minute. The customer hands items over one at a time, slowly. The clerk's speed doesn't matter - the line moves at the customer's pace.

A GPU's compute units are the clerk. They can do trillions of math operations per second. But the math operates on data - model weights, in-flight conversation state - and that data lives in memory. Normal computer memory (DDR) delivers ~100 GB/s. GPU compute can eat ~10 TB/s when running flat out. So the bottleneck is memory bandwidth, not raw math.

HBM (High Bandwidth Memory) is the fix: stacks of DRAM dies sitting right next to the GPU chip on the same interposer, wired with a wide parallel bus. A B200 has ~8 TB/s of HBM bandwidth, about 80× normal RAM. That's why HBM costs 5× normal DRAM and why a 192 GB B200 is so expensive: the memory is the expensive part.

In this game: each tier's capacity (A100 280 / H100 1100 / B200 1700 inf/day) is set mostly by HBM size and bandwidth, not raw FLOPS. When you see "KV cache full" in a news event, the HBM ran out of room for in-flight conversations.

02 PUE: why the AC eats half your power bill

Run a 700 W GPU. It dumps 700 W of heat into the room. In a server closet, that heat piles up fast. You need cooling - fans, chillers, sometimes water loops - and the cooling itself burns electricity.

PUE (Power Usage Effectiveness) is the ratio: (total facility electricity) / (electricity actually doing compute). PUE 1.0 would be magic: cooling, lights, networking all cost zero. In reality:

Iceland pulling outside air straight in: PUE 1.05-1.12 - basically free cooling.
Mild climate, decent build: PUE 1.20.
Texas in August, air-cooled: PUE 1.40-1.50.

A PUE of 1.45 means for every dollar of GPU electricity, you spend 45 cents on cooling. That's why Iceland, Quebec, Wyoming, and northern Sweden win on operating margin - and Phoenix loses.

In this game: each region carries a basePUE (TX 1.45, CA 1.20, IS 1.05). Your daily power bill is fleet kWh × PUE × $/kWh. When a heatwave hits the news ticker, PUE jumps another 0.35 and your cooling kWh starts bleeding you out.

03 Inference is two different jobs: prefill + decode

When a user asks an LLM "explain quantum entanglement," two phases run back to back.

Prefill: the model reads the prompt (say 200 tokens) all at once, in parallel. The answer doesn't depend on the prompt tokens yet, so they don't have to be processed in order. This phase is compute-bound - lots of math to do, GPU horsepower available, batches well across users.

Decode: the model writes the answer one token at a time. Token N+1 needs tokens 1...N to already exist. No parallelism. Each new token requires re-reading all the model's weights out of HBM. This phase is memory-bandwidth-bound - the GPU's math units sit idle waiting on memory.

So a GPU's "tokens per second" is much slower than its theoretical FLOPS would suggest. Long answers spend most of their time in decode, capped by HBM bandwidth, not compute.

In this game: when a news card says "long-context decode is memory-bound" (day 24), it's telling you the H100 hits its HBM ceiling before its math ceiling. The B200 (bigger HBM, faster bus) handles it; the H100 stalls.

04 KV cache: why long conversations get expensive

To write token N+1, the model has to "remember" what tokens 1...N said. Re-deriving that from scratch every step would be 1000× too slow. So during decode the model keeps a running scratch-pad called the KV cache (keys and values from the attention mechanism). It lives in HBM, next to the model weights.

Rough size: ~0.5 MB of KV cache per token per active conversation for a 70 B model. A single 1 M-token conversation eats ~500 GB of HBM by itself. You cannot fit that on one H100 (80 GB HBM). Your options:

Spill the cache to slower memory (slow).
Evict another user's session (drops their chat).
Shard the cache across more GPUs (only works with NVLink).

That's why 1 M-context windows are economically painful: each long-context user hogs the HBM that would otherwise serve 30 short users.

In this game: when "Opus 5 ships 1 M-context" hits the news, your H100 capacity is halved that day. The fix in real life is more/bigger HBM (move to B200) or a connected cluster (NVL72).

05 Cluster not card: NVLink and NVL72

Frontier models (GPT-6, Opus 5-class, ~1 trillion parameters) don't fit on one GPU. A B200 has 192 GB HBM; trillion-param weights are 400-800 GB depending on how aggressively you quantize. So you shard: split the weights across 8 GPUs, each holding 1/8 of the model.

The catch: at every forward pass, those 8 GPUs have to swap partial answers with each other. Over normal PCIe (~64 GB/s) they spend more time waiting on the network than computing. The cluster is then slower than a single big GPU would be, if a single big GPU existed.

NVIDIA's answer is NVLink: a direct GPU-to-GPU bus at ~900 GB/s, about 14× PCIe. NVSwitch ties many NVLink lanes into a mesh. Bundle 72 B200s in one rack with NVSwitch fabric and you get an NVL72: 72 GPUs behaving like one logical GPU with 13.8 TB of HBM and ~130 PFLOPS. Costs ~$3M per rack and pulls ~120 kW - more than a small neighborhood draws.

In this game: pre-day 15, your B200s are isolated cards (1700 inf/day each). Day 15 the OEM ships NVL72 and your B200s become 2210 inf/day each. Same chips, new topology, frontier workloads now run coherently.

06 Are we in a bubble?

In 2024-2026, the hyperscalers (Microsoft, Meta, Google, Amazon, Oracle, OpenAI's Stargate consortium) committed roughly $500 B of new datacenter buildout. Annual AI capex is now 2-3× the entire 1999 telecom buildout, in real dollars. Every quarter brings a bigger number.

The bull case: inference demand is growing exponentially. Every white-collar workflow eventually has an agent. Every search query becomes an LLM query. Capacity built today is full by the time it's online.

The bear case: model improvement is plateauing (the "frontier wall"). DeepSeek-style efficiency shocks make models 10× cheaper to run, blunting demand growth. Stargate alone could glut the market by 2027.

Truth: nobody knows. The capex is real, the demand is real, the question is timing. If you want to track it, SemiAnalysis runs the spreadsheet.

In this game: the day-28 finale is a coin flip. Heads = demand 3× (bull peak). Tails = compute glut, resale -40% (bear correction). Same fundamentals, different macro outcome - which is exactly the actual debate.

> WHAT IS THIS?

BOOKS

PLANT

QUEUE

> LEADERBOARD - TOP 100