Thermal-Continuity Workload Routing

Abstract

AI data centers are increasingly discussed as energy problems. They are also continuity problems. Workloads arrive over time, heat accumulates over time, cooling systems respond over time, and hardware degradation compounds over time. A scheduler that ignores thermal history may make locally rational decisions that create global inefficiency — routing a heavy workload to a node that appears available but is already near its thermal limit, triggering throttling, cooling spikes, and cascading effects that reduce overall throughput.

Thermal-Continuity Workload Routing proposes a simple open-source simulator in which GPU nodes are routed using a combined score: available compute, thermal headroom, recent heat trajectory, and cooling recovery time. The simulator allows contributors to compare random scheduling, least-loaded scheduling, and thermal-continuity scheduling across measurable outcomes — peak heat, average heat, cooling proxy cost, throttling events, and job delay. The hypothesis is that thermal-continuity scheduling reduces peak heat and thermal oscillation without major throughput loss. That result, demonstrated in a reproducible simulator, would justify deeper research without requiring access to real data-center infrastructure.

II. The Core Formula

Thermal-Continuity Node Score — Conceptual DefinitionTCS = AvailableCompute
      - HeatPenalty
      - ThermalTrendPenalty
      - CoolingRecoveryPenalty

Where:

AvailableCompute     = fraction of node capacity not currently committed
                       Range: 0–1

HeatPenalty          = max(0, (current_temp - safe_threshold) / temp_range)
                       = 0 below safe threshold; increases linearly above it

ThermalTrendPenalty  = rate_of_temperature_increase over last N minutes
                       = positive if heating, negative if cooling
                       Normalized to 0–1 range

CoolingRecoveryPenalty = estimated_time_to_baseline / max_recovery_time
                         = 0 if at baseline; 1 if at maximum recovery time

Scheduling rule: assign job to node with highest TCS above minimum threshold
Fallback: queue job if no node exceeds minimum TCS

Note: Formula is conceptual. Real thermal constants should be
sourced from hardware specifications and data-center literature.

III. Experiment Design

The experiment is reproducible by any contributor with Python and standard scientific computing libraries. No access to real data-center infrastructure is required.

Simulation Parameters

20 to 100 GPU nodes arranged across 4 thermal zones. Each node has a base temperature, maximum operating temperature, heat generation rate proportional to workload intensity, and cooling recovery rate. Thermal zones share ambient temperature and cooling infrastructure, creating realistic dependencies between nodes in the same zone.

Job Generation

AI jobs are generated with random duration (1–60 minutes), intensity (0.1–1.0 compute fraction), and latency tolerance (immediate, flexible within 10 minutes, or batch). The mix reflects a realistic data center load: approximately 20% real-time inference, 40% flexible batch inference, and 40% training or evaluation workloads.

Three Scheduling Policies

Random: Assign each job to a randomly selected available node. Baseline for comparison.

Least-loaded: Assign each job to the node with the most available compute. Standard industry practice.

Thermal-continuity: Assign each job using TCS as defined above. Preferred node is highest TCS above minimum threshold.

Measurement Outcomes

Peak heat per zone across the simulation period. Average heat per zone. Throttling proxy events (node temperature exceeding 95% of maximum for more than 5 minutes). Job delay (time from submission to start, for flexible and batch jobs). Cooling proxy cost (integral of cooling demand above baseline across simulation period).

IV. Related Work — Kubernetes and Existing Thermal-Aware Schedulers

Kubernetes scheduling supports node affinity, resource requests, and custom scheduling plugins. The kube-scheduler does not natively consider thermal state — it treats CPU and memory availability as the primary resource dimensions. Thermal-continuity routing extends this by adding thermal headroom and trend as first-class scheduling inputs. A Kubernetes scheduler plugin implementing the TCS formula is a natural implementation target for the open-source contribution.

YARN ResourceManager (Apache Hadoop) supports node label expressions and resource profiles but similarly lacks thermal continuity awareness. The thermal-continuity routing approach is implementable as a YARN node label strategy.

Published thermal-aware scheduling research (HotCarbon 2025, TAWS, and earlier data-center literature) has demonstrated that thermal-aware workload placement reduces cooling energy by 8-15% in simulation. These studies typically use instantaneous thermal state without the trend and recovery components the TCS formula introduces. Continuity Compression adds two dimensions those studies lack: the thermal trend penalty (recent trajectory, not just current state) and the cooling recovery penalty (estimated time to return to baseline). The benchmark should explicitly compare against the instantaneous-state baseline used in prior work.

Rack thermal inertia. Physical GPU racks have significant thermal inertia — a rack that has been running at 85% capacity for 30 minutes will continue to heat for several minutes after workload reduction. The TCS formula's thermal trend penalty partially captures this but does not model inertia explicitly. Contributors should use published thermal time-constant values (typically 5-15 minutes for GPU rack cooling) when calibrating the simulator's thermal constants.

IV.5 Hardware Degradation Modeling

The Thermal-Continuity Routing proposal addresses immediate scheduling efficiency, but thermal management has a compounding long-term consequence that strengthens the economic case: hardware degradation. This is a significant omission from most data center scheduling discussions.

GPU throttling. Modern GPUs reduce clock speed automatically when temperature thresholds are crossed. A GPU operating at 85°C may throttle to 70% of rated performance. Workloads scheduled to nodes that are near thermal limits therefore receive reduced computational throughput even if the node "appears available" in capacity terms. The TCS formula's HeatPenalty term partially addresses this — but an explicit throttling proxy should be included in the benchmark: estimated compute-hours delivered versus nominal compute-hours scheduled.

Memory degradation. DRAM operating temperatures above 85°C accelerate soft error rates and long-term reliability degradation. GDDR6X and HBM2e memory in high-end GPUs show measurable reliability impacts from sustained high-temperature operation. Thermal-continuity routing that reduces average operating temperature extends effective hardware lifetime — a capital cost benefit that should be included in the economic modeling alongside energy savings.

SSD wear acceleration. NVMe SSDs in inference servers are sensitive to sustained high temperatures. Operating above rated temperature accelerates NAND cell wear and reduces drive lifetime. Storage nodes adjacent to high-heat GPU clusters may experience accelerated SSD degradation through thermal coupling even if not directly loaded.

Cooling system wear. Rapid thermal cycling — nodes heating quickly and cooling rapidly — imposes mechanical stress on cooling hardware including fans, heat pipes, and liquid cooling components. Systems that maintain more stable thermal profiles through continuity-aware scheduling reduce the frequency of large thermal transitions and thereby reduce cooling system wear.

Aggregate hardware lifetime impact. A rough model: a 10°C reduction in average GPU operating temperature corresponds to approximately a 2× reduction in electromigration failure rate (Arrhenius relationship, commonly cited in semiconductor reliability literature). Over a 3-5 year hardware lifecycle, thermal-continuity routing that achieves even a 5°C reduction in average temperature could meaningfully extend effective hardware lifetimes — a capital cost benefit that likely exceeds the energy savings in economic terms. The open-source simulator should include a hardware lifetime proxy metric alongside the primary thermal metrics.

Known Limitations

This section follows the Foundation's institutional practice of explicitly stating known weaknesses, failure modes, and scope boundaries for every proposal. Its presence indicates analytical maturity, not weakness in the underlying proposal.

Simulator fidelity gap. The proposed simulator uses simplified thermal proxy models. Real data center thermal behavior involves complex three-dimensional heat transfer and rack-to-rack thermal coupling that proxy models approximate at best. Results should be treated as directional indicators, not quantitative predictions for real deployments.

Scheduler instability under rapid load change. The TCS formula responds to thermal trends and may oscillate under rapidly changing load conditions. Anti-oscillation mechanisms (hysteresis, minimum dwell time) are not specified in the current proposal.

Geographic and regulatory constraints. Thermal routing across geographically distributed data centers introduces network latency penalties. Regulatory constraints on data residency may prevent cross-jurisdiction migration even when thermally optimal.

Adversarial load shaping. A workload that misreports its thermal characteristics could receive preferential routing at the expense of honest workloads. The proposal does not include mechanisms for validating workload thermal declarations.

What This Paper Does Not Claim

That the TCS formula produces optimal scheduling in all conditions — it is a heuristic that performs well in simulation scenarios with realistic thermal parameters
That specific cooling energy savings percentages will be achieved — projected savings depend heavily on data center architecture and workload mix
That thermal-continuity routing eliminates hardware degradation — it reduces thermal stress, which is one factor in degradation among many
That the approach is production-ready — it is a near-term experimental proposal requiring simulation validation before operational deployment

Non-Adoption Scenario

Without thermal continuity awareness in workload scheduling, data centers accumulate preventable thermal stress on high-utilization nodes. Consequences include accelerated hardware replacement cycles increasing capital expenditure; increased cooling costs from reactive thermal management; more frequent throttling events reducing effective compute throughput; and higher rates of thermal-induced failures in dense GPU clusters where thermal coupling creates cascade effects invisible to node-local schedulers.

Open Questions

What thermal time constants most accurately model GPU rack cooling behavior across different cooling architectures? How should the TCS formula be calibrated for different hardware generations with different thermal envelopes? What is the minimum sensor density required for reliable TCS scoring, and what is the scheduling degradation when sensor data is delayed or missing?

Governance Implications

Thermal-continuity routing systems making autonomous workload migration decisions require governance frameworks specifying: who can override scheduling decisions; what SLA guarantees are maintained during thermal events; how scheduling logs are retained for audit; and what constitutes a thermal incident requiring human review. Automated schedulers without audit logs and override mechanisms create accountability gaps when scheduling decisions contribute to hardware failures or SLA violations.

References and Related Work

Patterson, M.K. (2008). The Effect of Data Center Temperature on Energy Efficiency. Intel Technology Journal. · Choi, J. et al. (2008). Thermal-Aware Task Scheduling at the System Software Level. ISLPED. · ASHRAE TC9.9 (2021). Thermal Guidelines for Data Processing Environments. · Lawrence Berkeley National Laboratory (2024). United States Data Center Energy Usage Report.

V. Falsifiability

✗Peak heat reduction below 5% versus least-loaded baseline — the improvement is within simulation noise and does not justify the scheduling overhead.

✗Job delay increase exceeding 15% for flexible workloads versus least-loaded baseline — thermal-continuity scheduling imposes unacceptable throughput cost.

✗TCS formula sensitivity analysis showing results dominated by the AvailableCompute term — indicating the thermal components add noise rather than signal, and least-loaded scheduling is equivalent in practice.

Open Source Contribution Invitation

Build the Python simulator with configurable node count, thermal zone count, job mix parameters, and scheduling policy selection. Produce matplotlib or plotly charts for all five outcome metrics. Include an animation showing heat distribution across nodes over simulation time. Add a README with reproducible parameters and comparison against published thermal-aware scheduling baselines from the literature. Package as github.com/emfoundation/thermal-continuity-routing.

Contact: research@emfoundation.net