= Kubernetes Nodepool Scheduling =

**Summary**: This wiki page shows how I configure my AKS nodepools and migrate pods between nodepools if needed. \\
**Date**: 2 January 2026 \\
{{tag>kubernetes azure}}

I would like to start with explaining what nodepools are, especially in Azure Kubernetes Service (AKS). However, sometimes, the [[https://learn.microsoft.com/en-us/azure/aks/use-system-pools |documentation]] is just very good:

> In Azure Kubernetes Service (AKS), nodes of the same configuration are grouped together into node pools. Node pools contain the underlying VMs that run your applications. System node pools and user node pools are two different node pool modes for your AKS clusters. System node pools serve the primary purpose of hosting critical system pods such as CoreDNS and metrics-server. User node pools serve the primary purpose of hosting your application pods.

== Nodepool Pod Scheduling Management ==

Pod scheduling in Kubernetes is managed using (among others) using taints and tolerations. Taints are applied to nodes and allow a node to repel a set of pods unless those pods have a matching toleration. Tolerations are applied to pods and allow (but do not require) the pods to schedule onto nodes with matching taints. On AKS, labels are also an important part of the scheduling process.
\\
I usually try to keep it simple, by using these directives for nodepools:

* Each nodepool has a label that indicates its mode: system or user.
* If there is just one user nodepool, it has no taints.
* If there are multiple user nodepools, each //additional// nodepool has a taint that indicates the purpose of that nodepool.

Additionally, I use the following application (pod) scheduling directives:

* Each application has an affinity for the mode of nodepool it should be scheduled on (system or user).
* If there are multiple user nodepools, each application that needs to get scheduled on one of the additional nodepools also gets a toleration that matches the taint of the nodepool it should be scheduled on.

> Note: The number of nodepools should be kept low, because each nodepool will have a node that is not used to it's maximum capacity, adding costs (and complexity).

== Nodepool Setup ==

The setup below shows an example of the above setup:

* System
  * Label: kubernetes.azure.com/mode:system
  * Taint: CriticalAddonsOnly=true:NoSchedule
* npusrdefault
  * Label: kubernetes.azure.com/mode:user
  * Taints: none
* npmobileapp
  * Label: kubernetes.azure.com/mode:user
  * Taint: pool=mobile:NoSchedule
* nprisk
  * Label: kubernetes.azure.com/mode:user
  * Taint: pool=risk:NoSchedule

> Note that the name of a node pool can only contain lowercase alphanumeric characters and must begin with a lowercase letter. For Linux node pools, the length must be between 1-12 characters. For Windows node pools, the length must be between 1-6 characters.

With the nodepools above, te setup is that system pods (like CoreDNS) get scheduled on the system nodepool, all other pods get scheduled on the npusrdefault nodepool unless they have a toleration for either the npmobileapp or nprisk nodepools.

=== System ===

So, to make sure a pod get scheduled on the system pool, we do set the following nodeAffinity rule:

<code yaml>
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.azure.com/mode
            operator: In
            values:
              - system
</code>

This rule makes sure the pod only gets scheduled on nodes that have the label `kubernetes.azure.com/mode=system`, which is only true for the system nodepool. But we also need to set a toleration, because the system nodepool has a taint:

<code yaml>
tolerations:
  - key: "CriticalAddonsOnly"
    operator: Exists
</code>

Combined, these setting will make sure the pod gets scheduled on the system nodepool.

=== User Nodepools ===

The user nodepools are require the same setup, but obviously with different values. First we need the a nodeAffinity rule that makes sure the pod only gets scheduled on user nodepools. Depending on your preference you can use a 'NotIn' or an 'In' operator:

<code yaml>
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.azure.com/mode
              operator: NotIn
              values:
                - system
</code>
<code yaml>
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.azure.com/mode
            operator: In
            values:
              - user
</code>

Either of these rules will work, to make sure the pods will only be scheduled on user nodepools. However, depending on the nodepool you want the pod to be scheduled on, you also need to set a toleration:
\\
<code yaml>
tolerations:
  - key: "pool"
    operator: "Equal"
    value: "mobile"
    effect: "NoSchedule"
</code>

> Note: Change the value to {{{risk}}} for the nprisk nodepool.

=== Default User Nodepool ===

The default user nodepool does not have a taint, so that any pod can always be scheduled. I prefer this because I favor uptime to control. This is however a personal preference, and different use cases might require different setups. Note that this means that even when setting a nodeAffinity to one of the //additional// user nodepools, the pod can still be scheduled on the default user nodepool, for example when the additional user nodepool is full or not available.

=== Additional Affinity ===

If you need to prevent the situation that pods get scheduled on the default user nodepool, additional affinity rules are required. In AKS, the {{{agentpool}}} label can be used for this purpose:

<code yaml>
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.azure.com/agentpool
            operator: In
            values:
              - npmobileapp
</code>

This however will break flexibility in case of migrations, upgrades or something like new naming conventions. This can be dealt with by adding more values like this:

<code yaml>
# Migrating from the user nodepool to the npapp01 nodepool
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.azure.com/agentpool
            operator: In
            values:
              - npmobileapp1
              - npmobileapp2
</code>

//This wiki has been made possible by://

<HTML>
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8613096447910897"
     crossorigin="anonymous"></script>
<!-- Wiki End of Page -->
<ins class="adsbygoogle"
     style="display:block"
     data-ad-client="ca-pub-8613096447910897"
     data-ad-slot="6221699236"
     data-ad-format="auto"
     data-full-width-responsive="true"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>
</HTML>