Twenty years ago, operating system kernels could go idle easily: in the absence of tasks, “the idle loop” would be scheduled. Early idle loops were essentially infinite loops doing nothing while waiting for the next interrupt. This saved power by avoiding power-hungry components such as the cache or FPU.
Over time, new hardware mechanisms to reduce power were introduced. Today, the idle loop must select and deploy the best way to go idle. Entering and exiting an idle state has costs in both time and energy. Shallow idle states are nearly cost-free to enter/exit, while deeper idle states have higher costs. If a system enters a deep idle state and wakes soon after, more energy is wasted entering that state than saved while in it.
Cpuidle is the kernel subsystem managing idle state transitions. It uses drivers and governors to separate mechanism from policy. Cpuidle drivers provide the mechanism for entering/exiting idle states and describe energy-saving properties of each state. They can be customized for each System-on-Chip (SoC) but are often standardized interfaces like PSCI on Arm and SBI on RISC-V. Cpuidle governors provide policy to select the best idle state using driver information and historical/expected events data, such as timer wakeups.
Cpuidle governors get system data from cpuidle drivers in the form of idle states list, each with a target residency time. This time reflects the minimum duration a CPU must remain in an idle state to save energy compared to shallower states. Governors track historic idle timings, visible in the /sys/devices/system/cpu/cpu/cpuidle, to predict future periods with governors like ladder and haltpoll using historic data exclusively.
Governors like menu and teo use limited future views, often predicting timer interrupts perfectly. Timer interrupts are frequent, making them reliable predictors. Governance decisions also consider the energy cost of decision-making itself, making fast decisions to enter shallow states desirable for short idle periods.
Menu and teo governors use the timer to make decisions, but with distinct strategies. The menu governor predicts sleep time using the next wake-up time with history-derived correction factors, then selects the best idle state. The teo governor bins historical information based on target residency for each idle state to predict the optimal state without predicting idle time, using the next wake-up for adjustments.
Tuning idle behavior for lower power involves going tickless, either disabling the scheduler tick when the system goes idle (CONFIG_NO_HZ_IDLE=y) or with only one task on the CPU (CONFIG_NO_HZ_FULL=y). This allows longer idle state residency. Check NO_HZ documentation for detailed differences. Avoid periodic polling by drivers when going tickless to prevent energy inefficiencies.
Choosing a different cpuidle governor also affects power management. Menu and teo governors are available, with menu predicting idle duration based on historic wake-up intervals and next timer interrupt time, while teo uses statistical data directly to predict the best state. Control these via /sys/devices/system/cpu/cpuidle/current_governor.
Consider Power Management Quality of Service (PM QoS) requests for scenarios where time cost of entering/exiting an idle state matters. PM QoS helps avoid missing real-time deadlines by registering latency tolerances, preventing deep idle states that exceed these tolerances.
Through understanding cpuidle and its tuning options like schedutil and the teo governor, developers can optimize systems balancing power savings and performance, considering workload and hardware limitations.
