Kubernetes Scheduling Explained: Why It’s More Than Just ‘Assigning

Jaime Nagase
8 min readAug 5, 2024

--

Have you ever wondered why Kubernetes uses the term “scheduling” instead of simpler words like “assigning” or “allocating” when placing pods on nodes? This choice reflects the system’s sophisticated and dynamic nature. And during my Kubernetes Journey I was wondering why the Kube-scheduler was named as it was and why the verb is always scheduling and not placing or assigning.

First of all, we need to understand that the Kube-scheduler is responsible for finding the best node for the next pod in the scheduling queue. Those Pods are placed in that queue by the Kube-apiserver and that queue is monitored by watch API which provides information to Kube-scheduler to start its actions. Let’s see it in details.

Understanding the Kube-Scheduler

The Kube-scheduler is the brain behind pod placement in a cluster. It evaluates numerous mandatory factors to determine the best node for each pod, ensuring efficient use of resources and meeting/filtering specific node requirements, such as:

  • Node Affinity/Anti-Affinity: These rules dictate which nodes a pod can or cannot run on based on labels. For example, a pod might be scheduled only on nodes with SSDs or excluded from nodes used for specific purposes.
  • Taints and Tolerations: Nodes can be “tainted” to repel certain pods unless those pods have matching tolerations. This ensures precise control over pod placement.
  • Pod Affinity/Anti-Affinity: This feature allows pods to be scheduled together or kept apart, enhancing performance and resilience by co-locating related services or avoiding single points of failure.
  • Custom Schedulers: Kubernetes supports custom schedulers that use different algorithms and policies tailored to specific needs.

The Role of the Watch API

The Watch API in Kubernetes helps the Kube-scheduler monitor real-time changes in the cluster. It notifies the Kube-scheduler immediately when new pods are added to the scheduling queue, allowing for quick and efficient pod placement.

The Kube-scheduler starts the scheduling process when it receives requests, and the Watch API is crucial for its responsiveness. By keeping the system updated with real-time changes, the Watch API ensures that pods are placed on nodes efficiently and promptly.

How It Works:

  1. Event Detection: The Watch API constantly monitors the cluster for changes, such as the creation of new pods or updates to node status.
  2. Event Stream: When a change is detected, the Watch API generates an event stream that continuously provides updates about the cluster’s state.
  3. Notification: The Kube-scheduler subscribes to this event stream and receives notifications about these changes.
  4. Queue Update: New pods are placed in the scheduling queue by the Kube-apiserver, which the Kube-scheduler monitors.
  5. Scheduling Decision:
  • Filtering: The Kube-scheduler first filters nodes to find those that meet the requirements of the pod, such as available resources and specific labels.
  • Scoring: The Kube-scheduler then scores the filtered nodes based on various factors like resource utilization, affinity/anti-affinity rules, and custom policies. Each node receives a score reflecting its suitability for the pod.
  1. Node Selection: The node with the highest score is selected for the pod.
  2. Binding: The Kube-scheduler then binds the pod to the selected node, updating the cluster state.
  3. Continuous Monitoring: The Watch API continues to monitor the cluster, providing real-time updates to ensure the Kube-scheduler can respond promptly to any new changes or events.

Example of the Scheduling Process in Action

To provide a comprehensive understanding of the scheduling process for the YAML-defined Pod, let’s include details on binding, score processing, and the role of the queue and watch APIs in Kubernetes.

Pod Scheduling Workflow Overview

The Kubernetes scheduler follows a multi-step process to determine the best node for a Pod. Here’s an expanded view of this process:

1. Queueing: When a Pod is created, it is placed into a scheduling queue.

  • Queue: The scheduling queue holds Pods that need to be scheduled. The queue is managed in a way that prioritizes certain Pods based on policies (e.g., priority classes).
  • Watch API: The Kubernetes Watch API allows the scheduler to watch for changes in the cluster, such as node availability, resource usage, and taint changes. This enables the scheduler to react to changes in real-time and reschedule Pods if necessary.

2. Filtering: The scheduler filters nodes that meet the Pod’s requirements (e.g., node affinity). During the filtering phase, the scheduler checks each node to see if it meets the following criteria:

  • Node Affinity: Nodes must be in the specified zones (“us-west-2a” or “us-west-2b”).
  • Taints and Tolerations: Nodes must not have taints that the Pod cannot tolerate.

Only nodes that pass all these checks move on to the scoring phase.

3. Scoring: Nodes that pass the filtering stage are scored based on various criteria. In the scoring phase, each node is evaluated and assigned a score based on various criteria. The scheduler uses these scores to select the most suitable node. Criteria can include:

  • Resource Availability: Nodes with more available CPU and memory might score higher.
  • Pod Affinity/Anti-affinity: Nodes might score higher if they align with Pod affinity or anti-affinity rules.
  • Custom Scoring Plugins: Kubernetes allows for custom plugins that can influence scoring based on specific requirements.

4. Binding: The scheduler binds the Pod to the selected node. Once the best node is selected, the scheduler creates a binding object to bind the Pod to the node. This involves updating the Pod’s `spec.nodeName` field to indicate the chosen node.

5. Watching: The scheduler watches for changes that might affect scheduling decisions. The Watch API allows the scheduler to continuously monitor the state of the cluster. Key changes that the scheduler watches for include:

  • Node Changes: Addition or removal of nodes, changes in node conditions, or resource availability.
  • Pod Changes: Changes in Pod status, creation of new Pods, or deletion of existing Pods.
  • Taints and Labels: Updates to taints and labels on nodes that might affect scheduling decisions.

The scheduler reacts to these changes by re-evaluating pending Pods in the queue and making new scheduling decisions as needed.

Let’s go through the scheduling steps for the provided Pod:

apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "topology.kubernetes.io/zone"
operator: In
values:
- us-west-2a
- us-west-2b
tolerations:
- key: "dedicated"
operator: "Equal"
value: "experimental"
effect: "NoSchedule"
- key: "critical"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 3600
containers:
- name: nginx
image: nginx:1.21

Step-by-Step Scheduling:

1. Queueing:

  • Action: The example-pod is added to the scheduling queue.
  • Details: When a pod is created, it is initially placed in a scheduling queue where it waits for the scheduler to process it. The scheduler continuously monitors the queue for new pods using the Watch API to promptly detect and process new scheduling requests.

2. Filtering:

a) Node Affinity:

  • Action: The scheduler filters nodes to ensure they are in us-west-2a or us-west-2b.
  • Details: The nodeAffinity rule specifies that the pod can only be scheduled on nodes within the specified zones (us-west-2a or us-west-2b). Nodes outside these zones are filtered out.

b) Taints and Tolerations:

  • Action: The scheduler checks for taints that the pod cannot tolerate.
  • Details: The tolerations section allows the pod to tolerate specific taints on nodes:
  • The first toleration allows the pod to be scheduled on nodes with the taint dedicated=experimental with the effect NoSchedule.
  • The second toleration allows the pod to run on nodes with the taint critical with the effect NoExecute, but it will be evicted after 3600 seconds (1 hour) if the taint is present.

3. Scoring:

  • Action: Nodes passing the filter are scored.
  • Details: Nodes that meet the affinity and toleration requirements are evaluated based on various criteria such as resource availability (CPU, memory), affinity rules, and other policies configured in the cluster. The scoring process helps determine the most suitable node for the pod.

4. Binding:

  • Action: The scheduler binds the example-pod to the highest-scoring node.
  • Details: Once the best node is selected, the scheduler updates the pod’s spec.nodeName to the name of the chosen node, effectively binding the pod to that node.

5. Watching:

  • Action: The scheduler uses the Watch API to monitor changes.
  • Details: The scheduler continuously monitors the state of the cluster, including any changes that might affect the scheduling of this pod or others. This includes changes in node conditions, resource availability, and taints.

By following this process, Kubernetes ensures that the example-pod is placed on a suitable node that meets both the affinity and toleration requirements, while also optimizing for cluster resource usage and efficiency.

Conclusion: Why “Scheduling” and Not “Assigning”?

Now, let’s imagine a cluster consisting of hundreds of microservices. These microservices serve both internal and external users, as well as other applications. In this scenario, the Kube-scheduler plays a crucial role in monitoring the queue and initiating the entire process millions or even billions of times each day for all applications.

All of these tasks must be well-organized and prioritized within the queue. Following the FIFO (First-In-First-Out) principle, the Kube-scheduler starts its work by scheduling the placement of Pods (if there is a prioritized POD it will goes first). Kubernetes, with its event-driven architecture, greatly benefits from utilizing a queue to determine the best Node for the next Pod placement. This event-driven approach provides several advantages, such as improved efficiency and optimal resource allocation.

Many of Kubernetes’ characteristics benefit from an event-driven architecture using a queue to schedule the next Pod to place in the best Node. Here are some ways this manifests:

  1. Decoupling and Modularity: Kubernetes is highly modular, with components like the API server, scheduler, and controller manager operating decoupled from one another. The event-driven architecture allows these components to communicate efficiently through events, improving system modularity and flexibility.
  2. Scalability: Events allow Kubernetes to scale efficiently. When new pods are created or nodes are added/removed, events are generated and processed by relevant components, enabling the cluster to adjust its capacity dynamically and automatically.
  3. Resilience and Fault Recovery: The event-driven architecture enables Kubernetes to respond quickly to failures. For example, if a node fails, events are generated and processed to reschedule the affected pods on other nodes, ensuring high availability and automatic recovery.
  4. Asynchronous Processing: Kubernetes uses events for asynchronous task processing. This means that instead of waiting for each operation to complete, components can continue functioning and respond to new events, improving efficiency and responsiveness.
  5. Observability and Monitoring: Kubernetes generates a vast amount of events that can be used for monitoring and observability. Tools like Prometheus can collect and analyze these events to provide detailed insights into the state and performance of the cluster.
  6. State Management: Controllers in Kubernetes are responsible for ensuring that the desired state of the system (as specified in resource definitions) matches the current state. They do this by reacting to events, such as the creation or modification of resources, and taking necessary actions to align the states.
  7. Flexibility and Extensibility: The event-driven architecture facilitates extensibility. New components or controllers can be added to Kubernetes to react to specific events and implement new functionalities without modifying existing components.

In summary, utilizing an event-driven architecture is fundamental to many of Kubernetes’ advanced capabilities, including scalability, resilience, modularity, and operational efficiency. This enables Kubernetes to manage complexity and dynamism effectively in modern infrastructure environments.

References

Kubernetes Scheduler

Scheduling Framework

Using Watch with the Kubernetes API

Kubernetes API Concepts

  • Description: General concepts of the Kubernetes API, including resource-based interfaces, watches, and efficient change notifications.
  • Link: Kubernetes API Concepts​ (Production-Grade Container Orchestration)​

--

--