Skip to main content

Scheduling

Course Reading

Learning objectives

  • Learn how the kube-scheduler schedules pods
  • Use labels to manage the scheduling of pods
  • Configure taints and tolerations
  • Use podAffinity and podAntiAffinity
  • Understand how to run multiple schedulers

kube-scheduler

Scheduling becomes more important as a deployment of Kubernetes grows and becomes more diverse. It's the kube-scheduler that handles this task.

Users are able to set pod priority, which allows other less important pods to be evicted if need be so that the higher priority pod can be scheduled. The scheduler will keep track of the nodes in your cluster, filters them, and then uses priority functions to determine the best node to schedule a pod on. Once determines, the pod specification is sent to that node's kubelet to be created.

Scheduling decisions can be influenced by using labels on either the node or pod. Labels for podAffinity, taints, and bindings allow more configuration of how a pod is assigned to a node. These labels do not have to be drastic either, they can serve the function, for example, of stating a preference of a Pod one where to be scheduled, but if that can not occur it could be scheduled somewhere else. We will oftentimes see require used in the name of parameters, but in reality this is more of a preference. Some setting will also evict pods if the conditions are no longer met (e.g. requiredDuringScheduling or requiredDuringExecution).

Custom schedulers can also be configured but need to be added to the cluster manually.

Filtering (predicates)

The scheduler runs through a list of filters, or predicates, to find which nodes are available, and then ranks the nodes using priority functions. The node that ranks the highest is chosen by the scheduler.

The following are the predicates used:

  • PodFitsHostPorts
  • PodFitsHost
  • PodFitsResources
  • MatchNodeSelector
  • NoVolumeZoneConflict
  • NoDiskConflict
  • MaxCSIVolumeCount
  • PodToleratesNodeTaints
  • CheckVolumeBinding

The predicates are evaluated in an order, which can be configured. This allows for the least amount of checks possible for the scheduler.

These predicates work to filter out the nodes for certain conditions. For example, the PodFitsResources would filter out any nodes that do not have the required resources for the pod pending creation.

Prior to v1.23 you could use a kind: Policy to order and give weight to certain predicates. Policies are now depreciated.

Scoring (Priorities)

Priorities are used to weight resources. Unless the SelectorSpreadPriority (ranks nodes based on number of existing running pods) setting has been configured with pod and node affinity, the node with the least amount of pods will be chosen, which is a basic way of spreading pods about the cluster.

Different priorities are used depending on cluster needs. For example, you could use an ImageLocalityPriority to favor nodes that have a container's image already downloaded.

The list of priorities that can be used are:

  • SelectorSpreadPriority
  • InterPodAffinityPriority
  • LeastRequestedPriority
  • MostRequestedPriority
  • RequestedToCapacityRatioPriority
  • BalancedResourceAllocation
  • NodePreferAvoidPodsPriority
  • NodeAffinityPriority
  • TaintTolerationPriority
  • ImageLocalityPriority
  • ServiceSpreadingPriority
  • EqualPriority
  • EvenPodsSpreadPriority

More detail on what each of these priorities do can be found here.

Prior to v1.23 you could use a kind: Policy to specify the priorities to use in scoring. As noted before, this is depreciated.

Scheduling Policies

important

Scheduling policies were depreciated after v1.23 in favor of scheduler configuration so newer versions will not have this capability. See more on scheduler configuration here.

By default, the scheduler has predicates and priorities it uses; these can be changed with scheduler policies however.

{
    "kind" : "Policy",
    "apiVersion" : "v1",
    "predicates" : [
        {"name" : "MatchNodeSelector", "order": 6},
        {"name" : "PodFitsHostPorts", "order": 2},
        {"name" : "PodFitsResources", "order": 3},
        {"name" : "NoDiskConflict", "order": 4},
        {"name" : "PodToleratesNodeTaints", "order": 5},
        {"name" : "PodFitsHost", "order": 1}
    ],
    "priorities" : [
        {"name" : "LeastRequestedPriority", "weight" : 1},
        {"name" : "BalancedResourceAllocation", "weight" : 1},       
        {"name" : "ServiceSpreadingPriority", "weight" : 2},
        {"name" : "EqualPriority", "weight" : 1}   
    ],
    "hardPodAffinitySymmetricWeight" : 10
}

Typically, you would use the --policy-config-file to pass the files and the --scheduler-name to give the scheduler a name. You'd then have two schedulers running and can specify which to use in the pod specification.

Having multiple schedulers opens up the potential for conflict in allocation of pods. Each pod should declare which scheduler it should use, but if not and separate schedulers determine that a node is eligible, and both deploy to it and then cause the resources to no longer exist, then a conflict occurs. The local kubelet would then return these pods to the scheduler to reassign until eventually one succeeds and another node is chosen for the other pod.

Pod specification

Most scheduling decisions are made off what's defined in the podSpec. The fields that carry much of the weight for this are:

nodeName and nodeSelector

These options allow a Pod to be assigned to a single node or a node with a certain label.

affinity

Affinity and anti-affinity can be used to set a requirement or preference for which nodes to schedule to. If using a preference, the matching node is chosen first, but other nodes are used if no match is found.

tolerations

Taints are placed on nodes so pods are not scheduled there for whatever reason. Tolerations allow pods to ignore the taint and be scheduled on that node as long as other requirements for scheduling are met.

schedulerName

If you want to define the use of a custom scheduler, you could use the schedulerName field to do so.

Specifying the node label

The nodeSelector field is a straightforward way to to target a nde or set of nodes for scheduling. It used one or more key-value labels.

For example:

spec:
  containers:
  - name: redis
    image: redis
  nodeSelector:
    net: fast

In the example we are using the nodeSelector to target any nodes with a net label and value of fast. Reminder that these labels and values are administrator created and are not actually necessarily tied to any properties of the hardware of the node.

A pod remains in a pending state until a node is found with matching labels.

You should be able to use node affinity to do the same thing as a nodSelector.

Scheduler profiles

The scheduler can also be configured via scheduling profiles. They allow for configuration of extension points where plugins can be used. An extension point is one of the 12 stages of scheduling, where a plugin can be used to modify the of a scheduler. The 12 are:

  • queueSort
  • perFilter
  • filter
  • postFilter
  • preScore
  • score
  • reserve
  • permit
  • preBind
  • bind
  • postBind
  • multiPoint

Many plugins are enabled by default, and others can be enabled, to change how the scheduler chooses a node for a give podSpec. The current plugin list can be found here.

A scheduler can have multiple profiles enabled, which potentially removes the need of having multiple schedulers.Each podSpec would need to declare a profile to use, otherwise the default-scheduler is used if none is configured.

Profiles are what have replaced policies.

Pod affinity rules

An example of affinity may be desiring two pods to be co-located since they often communicate and share data.

An example of anti-affinity would be wanting two pods running on separate nodes to increase fault tolerance.

These settings are used by the scheduler to look at the pods currently running and their labels. The scheduler then must interrogate each node and keep track of the labels on running nodes, and because of this, clusters larger than several hundred nodes can see significant performance degradation. Pod affinity rules use In, NotIn, Exists, and DoesNotExist operators.

requiredDuringSchedulingIgnoredDuringExecution

The pod will not be scheduled to a node unless the operator is true. If the operator becomes false in the future, the pod continues to run. This can be seen as a hard setting.

preferredDuringSchedulingIgnoredDuringExecution

Similar to requiredDuringSchedulingIgnoredDuringExecution, but if no node with the desired setting is present, it will be scheduled anyway. This declares a preference rather than requirement.

podAffinity

This is used to schedule pods together on the same node.

podAntiAffinity

And this is used to keep pods from being scheduled on the same node.

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2

Here is an example of pod affinity where a specific label needs to be matched for the Pod to be scheduled and started. However, this label could be removed in the future and the Pod would still run.

If no node is found with another pod running with a security=S1 label, then the pod remains unscheduled and pending until the condition is met.

In the pod anti affinity example, it is preferred that a node with no pods matching a security=S1 label be found. Also, because this is a preference, the pod will still run, but we are able to provide a weight to influence the scheduler. Weights can be from 1 to 100.

Node affinity rules

Aside from pod to pod affinity, we can also have pod to node affinity, which is managed by nodeAffinity. It is similar to nodeSelector and should one day replace it in later versions of the API. Instead of labels on other pods, nodeAffinity uses the labels on the nodes, so there is much less performance impact than with pod affinity.

Node affinity also uses the operators In, NotIn, 'ExistsandDoesNotExist. requiredDuringSchedulingIgnoredDuringExecutionandpreferredDuringSchedulingIgnoredDuringExecution` are also applicable for node affinity.

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: discspeed
            operator: In
            values:
            - quick
            - fast

Here in our example, our pod to be scheduled will prefer nodes that have a label with key discspeed and a value of fast or quick.

Taints

Taints are the opposite of node affinity.

To repel a pod from being deploy on a node you can set a taint. Pod will not be scheduled to that node unless they have the proper toleration, nor will the node even be considered. Taints are configured as key=value:effect. Any valid key, value label can be used for the taint.

There are three kinds of effects that can be applied for a taint. For any node that has multiple taints, the scheduler just ignores the ones with matching tolerations and allow the others to still take effect.

NoSchedule

A NoSchedule effect will not schedule the pod on a matching node, unless the Pod has a toleration of this type. Existing pods will continue to run regardless of toleration.

PreferNoSchedule

A "soft" version of NoSchedule. The node with this taint will be avoided if possible. It is a preference, rather than requirement.

NoExecute

This taint will evacuate all existing pods and not allow future pods to be scheduled, except those with the correct toleration. Pods with a tolerationSeconds set will wait however many configured seconds before eviction. Certain issues on a node may cause the kubelet to add 300 seconds tolerations to prevent unnecessary evictions.

Tolerations

Tolerations can be set on a pod to negate the effect of a taint on the node.

Tolerations can take two different operators, Equal and Exists. Equal is the default if an operator is not declared.

Here's two examples of tolerations.

tolerations:
- key: "region"
  operator: "Equal"
  value: "us-east-1"
  effect: "NoExecute"
  tolerationSeconds: 3600
tolerations:
- key: "example-key"
  operator: "Exists"
  effect: "NoSchedule"

In the Equal example, the pod will tolerate any nodes with a NoExecute taint for 3600 seconds and a label of region=us-east-1.

In the Exists example, no value is taken, as the Exist just looks for the existence of the label key. If no key was specified, the pod would tolerate all nodes with the configured effect.

Custom scheduler

If the affinity and taints are not flexible enough you can write a custom scheduler. This is outside the scope of the fundamentals of Kubernetes, but a good starting point is the existing scheduling code.

If a podSpec does not declare the scheduler to use, then the default is chosen. If the pod declares a scheduler that is not running, it would sit as Pending forever.

At the end of scheduling, the pod gets a binding specifying which pod it should run on. Bindings are Kubernetes primitives in the api/v1 group. You could technically still schedule a pod without a scheduler running by defining a binding yourself.

Multiple schedulers can be running at the same time. The scheduler can be views along with other information by running kubectl get events.

Lab Exercises

Lab 12.1 - Assign pods using labels

This lab will walk through scheduling pods on specific nodes using labels. While it is normally best to let the system distribute pods for you, sometimes there may be exceptions to this. For example, you may want a pod to run on a certain node due to hardware considerations.

First, let's get our list of nodes and view their current labels and taints.

ubuntu@cp:~ $ kubectl get nodes

ubuntu@cp:~ $ kubectl describe nodes | grep -A5 -i labels

ubuntu@cp:~ $ kubectl describe nodes |grep -i taint

Find the count of how many pods are running on each node. Take node of the difference.

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

For this exercice, we will make the cp node exclusive for VIP pods, and the worker for everything else. Then verify your new labels.

ubuntu@cp:~ $ kubectl label nodes <cp node name> status=vip

ubuntu@cp:~ $ kubectl label nodes <worker node name> status=other

ubuntu@cp:~ $ kubectl get nodes --show-labels

Now, let's create a new Pod running four busybox containers. We'll use nodeSelector to choose the node we want to deploy on.

ubuntu@cp:~ $ vim vip.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vip
spec:
  containers:
  - name: vip1
    image: busybox
    args:
    - sleep
    - "1000000"
  - name: vip2
    image: busybox
    args:
    - sleep
    - "1000000"
  - name: vip3
    image: busybox
    args:
    - sleep
    - "1000000"
  - name: vip4
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    status: vip

Then deploy the Pod and verify it was deployed on the cp node. You should see an additional pod running on the cp and no change on the worker.

ubuntu@cp:~ $ kubectl create -f vip.yaml

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

Delete the pod, comment out nodeSelector, then recreate the deployment. The Pod should be able to schedule on either node now.

ubuntu@cp:~ $ kubectl delete pod vip

ubuntu@cp:~ $ vim vip.yaml

ubuntu@cp:~ $ kubectl create -f vip.yaml

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

Copy the vip.yaml file, replace references to vip with other and add back the nodeSelector.

ubuntu@cp:~ $ cp vip.yaml other.yaml

ubuntu@cp:~ $ sed -i s/vip/other/g other.yaml

ubuntu@cp:~ $ vim other.yaml

Then create the new pod and verify where they get deployed.

ubuntu@cp:~ $ kubectl create -f other.yaml

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

Then delete the pods.

ubuntu@cp:~ $ kubectl delete pod vip other

Lab 12.2 - Using taints to control pod deployment

First, create a new deployment and deploy it.

ubuntu@cp:~ $ vim taint.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: taint-deployment
spec:
  replicas: 8
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.20.1
        ports:
        - containerPort: 80

ubuntu@cp:~ $ kubectl apply -f taint.yaml

Then determine where the pods for the deployment were scheduled.

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

You'll likely see that some of the pods were sent to one or the other node. Now let's delete the deployment, verify the pods are gone, and apply a taint to force where the pods will be deployed. Then we will redeploy the taint.yaml deployment.

Remember that the taints related to scheduling will have no effect on already running pods, while a NoExecute taint will evict currently running pods unless they have a matching toleration.

ubuntu@cp:~ $ kubectl delete deployment taint-deployment

ubuntu@cp:~ $ kubectl taint nodes <worker node name> bubba=value:PreferNoSchedule

ubuntu@cp:~ $ kubectl describe node | grep Taint

ubuntu@cp:~ $ kubectl apply -f taint.yaml

Verify the pods are up and running, the check where they were deployed to. Once you verify where the pods were deploy to, delete the deployment again.

ubuntu@cp:~ $ kubectl get pods

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl delete deployment taint-deployment

Now let's untaint the node and apply a new one, this time using the NoSchedule effect. Then we will redeploy and check where the pods where scheduled.

ubuntu@cp:~ $ kubectl taint nodes <worker node name> bubba-

ubuntu@cp:~ $ kubectl taint nodes <worker node name> bubba=value:NoSchedule

ubuntu@cp:~ $ kubectl apply -f taint.yaml

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

Finally, we will use the NoExecute effect. Do the same process of untainting, then retainting, and then verify the number of pods running per node.

ubuntu@cp:~ $ kubectl taint nodes <worker node name> bubba-

ubuntu@cp:~ $ kubectl taint nodes <worker node name> bubba=value:NoExecute

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

Finally remove that taint and see how the pods are reshuffled again. They will likely not be distributed the same as they were originally. Once done, delete the deployment for good.

ubuntu@cp:~ $ kubectl taint nodes <worker node name> bubba-

ubuntu@cp:~ $ kubectl describe node <cp node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl describe node <worker node name> | grep "Non-terminated Pods"

ubuntu@cp:~ $ kubectl delete deployment taint-deployment

Knowledge check

  • Multiple scheduler can be run at the same time
  • Labels and annotations are not used for the same purpose
  • When a node is tainted, a pod requires a toleration to be deployed to that node
  • Not all taints cause pods to not run on a node