Kubernetes CPU requests explained

July 25, 2020

Requesting CPU resource in a pod spec does not guarantee the requested CPU for a container. The only way to guarantee this is to use "static CPU manager policy" and exclusively allocate CPUs to the container.

1. How CPU requests work

In kubernetes, a pod container specifies CPU resources as follows:

apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  containers:
  - name: foo
    image: mechpen/toolbox
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "1"

The CPU unit 1 means the 100% time of one CPU, as shown in the top command.

The CPU limit is enforced by CFS scheduler quota, by which processes of a container are throttled when the container CPU time reaches the limit.

The CPU request is implemented using cpu control group's cpu.shares. This post has more details on CFS and cpu.shares.

In kubernetes, one CPU provides 1024 shares. For example, if a node has 8 allocatable CPU, then the total number of CPU shares is 1024*8=8192 as shown below:

# cat /sys/fs/cgroup/cpu/kubepods/cpu.shares
8192

The container's cpu.shares is allocated from the total shares according to this CPU requests in the pod manifest. In the above example, container "foo" requests 1 CPU and has shares value "1024":

# cat /sys/fs/cgroup/cpu/kubepods/<pod_foo>/<container_foo>/cpu.shares
1024

2. Why CPU request does not work

The above implementation may seem good enough to ensure CPU times for containers: according to this CFS equation, each container's CPU time is proportional to its scheduling weight. Container "foo" gets 1/8 of the total 8 CPU share, so it gets 1 CPU out of the total 8 CPUs.

However, in SMP systems, cpu.shares does not equal to CFS weight as explained in this post. For example, 2 containers, both requesting 1 CPU, could be scheduled on the same CPU. Each container only gets 50% CPU at most.

3. How static CPU manager policy work

Kubernetes provides a static CPU manager policy that can "exclusively" allocate CPUs to a container by using the cpuset cgroup.

For example, if CPU 1 is assigned to container "foo". Then we have:

# cat /sys/fs/cgroup/cpuset/kubepods/<pod_foo>/<container_foo>/cpuset.cpus
1

For any other pod container, we have:

# cat /sys/fs/cgroup/cpuset/kubepods/<pod_bar>/<container_bar>/cpuset.cpus
0,2-7

Thus processes in the other containers are not scheduled on CPU 1.

4. Customize kubepods cgroup

The above discussions are within the kubepods cgroups. The system processes are not under the kubepods cgroup and not controlled by the above rules. For example, a user could ssh to the node and run a process on CPU 1, even the CPU is "exclusively" allocated to container "foo" in kubernetes.

The problem can be solved by pre-defining kubepods cgroups to allocate dedicated CPUs for kubernetes pods, then passing the customized kubepods cgroup to kubelet via the --cgroup-root option. (I didn't try this out.)

5. Extra

I wrote a tool tgtop to help observe the above CPU usage behaviors.