= Upgrading AKS using Terraform =
**Summary**: In this post I'll show you how to upgrade your AKS cluster using Terraform. \\
**Date**: 23 February 2025 \\
{{tag>kubernetes terraform azure}}
Before we can start doing an upgrade, it's always good to gather some information about the current state of the cluster, as well as information about the new version and possible problems we might run into. Let's start with some links to the release notes and such and then continue with some commands to gather information.
\\
* [[https://kubernetes.io/releases/ |Kubernetes releases]]
* [[https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar |AKS Kubernetes release calendar]]
* [[https://github.com/Azure/AKS/releases |AKS Release Notes]]
It is also important to know, that when following this post, we will upgrade the following components in this order:
# Backplanes
# System nodepool
# User nodepool
== Getting Info ==
Once you're checked the versions and release notes and you're sure you want to continue, you can start with the following checks: \\
\\
First we check for available versions in our region. The output of the following command can be shown in a nice table, so we can actually check to which version we want to upgrade.
az aks get-versions --location westeurope --output table
KubernetesVersion Upgrades SupportPlan
------------------- --------------------------------------------------------------------------------- --------------------------------------
1.31.3 None available KubernetesOfficial
1.31.2 1.31.3 KubernetesOfficial
1.31.1 1.31.2, 1.31.3 KubernetesOfficial
1.30.7 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.6 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.5 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.4 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.3 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.2 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.1 1.30.2, 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
1.30.0 1.30.1, 1.30.2, 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
> Note that the output is shortened for readability.
So now we know which versions are available, we can check for possible upgrades for our specific cluster:
az aks get-upgrades --resource-group rg-privatecluster --name aks-privatecluster --output table
Name ResourceGroup MasterVersion Upgrades
------- ----------------- --------------- ------------------------------------------------------------------------------------
default rg-privatecluster 1.27.7 1.28.0, 1.28.3, 1.28.5, 1.28.9, 1.28.10, 1.28.11, 1.28.12, 1.28.13, 1.28.14, 1.28.15
Check the available upgrades. Our current versions is 1.27, so we only can go to 1.28. It's not possible to skip the minor versions.
Now we need one more check, pod disruption budgets. These can block the upgrade process, so it's good to check them before starting the upgrade:
Check for pod disruption budgets:
azadmin@vm-jumpbox:~$ kubectl get pdb -A
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
postgresql 1 N/A 1 199d
postgresql-primary 1 N/A 0 199d
Check the pod disruption budgets. The second pdb will block the upgrade as there are no disruptions allowed.
As a last check, we will note the current node status:
azadmin@vm-jumpbox:~$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-system-34974014-vmss000000 Ready agent 271d v1.27.7 172.16.48.103 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-system-34974014-vmss000001 Ready agent 271d v1.27.7 172.16.48.5 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-system-34974014-vmss000002 Ready agent 270d v1.27.7 172.16.48.54 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss000001 Ready agent 270d v1.27.7 172.16.64.4 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss000002 Ready agent 270d v1.27.7 172.16.64.102 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001r Ready agent 200d v1.27.7 172.16.64.53 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001t Ready agent 179d v1.27.7 172.16.64.249 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001u Ready agent 176d v1.27.7 172.16.65.42 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001v Ready agent 172d v1.27.7 172.16.64.151 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001x Ready agent 136d v1.27.7 172.16.65.91 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001y Ready agent 31d v1.27.7 172.16.65.189 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss00001z Ready agent 31d v1.27.7 172.16.65.140 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss000020 Ready agent 8d v1.27.7 172.16.64.200 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
As you can see, it's been a while since the last upgrade. Now we can start with the actual upgrade.
== Upgrade Using Terraform ==
We have some modules in place which manage the aks cluster. Both the cluster and the nodepools use the same variable as the version, so we only need to change it in one place.
# Variables file
# Azure Kubernetes Service (AKS)
kubernetes_version = "1.27.7"
# Module aks, shortened for readability
resource "azurerm_kubernetes_cluster" "aks_cluster" {
name = var.name
location = var.location
resource_group_name = var.resource_group_name
node_resource_group = var.node_resource_group_name
kubernetes_version = var.kubernetes_version
default_node_pool {
name = var.default_node_pool_name
orchestrator_version = var.kubernetes_version
}
}
# Module node_pool, shortened for readability
resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
kubernetes_cluster_id = var.kubernetes_cluster_id
name = var.name
orchestrator_version = var.orchestrator_version
}
Now we can change the version in the variables file:
# Variables file
# Azure Kubernetes Service (AKS)
kubernetes_version = "1.28.15"
== Timeout Considerations ==
Upgrading an aks cluster can be time consuming, especially on larger clusters. When doing all upgrades at once as shown above, an upgrade with about 10 nodes can take anything from 1 to multiple hours. In our case we were doing the terraform aks upgrade using an Azure DevOps Pipeline, so we had to change multiple timeouts to ensure a smooth upgrade.
=== Azure DevOps Pipeline Timeouts ===
> Note that changing the timeouts in an Azure DevOps pipeline either requires a paid offering or a self-hosted agent.
There are two timeouts to set. First, you need to set the timeout of the job, as well as the timout of the task. I prefer a timeout of 0 for the job, and then one that's appropriate for the task. In the example below I've set it to 3 hours, which should be enough for most upgrades.
- stage: terraform_plan_apply
displayName: 'Terraform Plan or Apply'
jobs:
- job: terraform_plan_apply
displayName: 'Terraform Plan or Apply'
timeoutInMinutes: 0
steps:
- task: AzureCLI@2
displayName: 'Terraform Apply'
timeoutInMinutes: 180
inputs:
azureSubscription: '$(backendServiceArm)'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
terraform apply \
-var-file=env/dev.tfvars \
-compact-warnings \
-input=false \
-auto-approve
workingDirectory: $(workingDirectory)
> Note that the pipeline example has tasks removed for readability. See [[terraformazuredevops]] for various working examples of Azure DevOps pipelines for terraform.
=== Terraform Timeout ===
Terraform also has timeouts, which can be changed for some resources. The [[https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster_node_pool#timeouts |Terraform registry]] shows if a resource timeout can be configured. In our case, the user node pool could take a long time, so we've set the timeout here as well.
# Module node_pool, shortened for readability
resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
kubernetes_cluster_id = var.kubernetes_cluster_id
name = var.name
orchestrator_version = var.orchestrator_version
timeouts {
create = "2h"
update = "2h"
}
}
> This prevents the terraform error {{{polling after CreateOrUpdate: context deadline exceeded}}}
== Pod Disruption Budgets ==
As mentioned before, pod disruption budgets can block the upgrade process. If you have a pdb that blocks the upgrade, you can either delete it or change the settings. In our case, we had a pdb that blocked the upgrade, so we got the following terraform error:
"message": "Upgrade is blocked due to invalid Pod Disruption Budgets (PDBs). Please review the PDB spec to allow disruptions during upgrades. To bypass this error, set forceUpgrade in upgradeSettings.overrideSettings. Bypassing this error without updating the PDB may result in drain failures during upgrade process. Invalid PDBs details: 1 error occurred:\n\t* PDB dev/postgresql-primary has minAvailable(1) \u003e= expectedPods(1) can't proceed with put operation\n\n",
In our case, we decided to force the upgrade. This is by done by setting a temporary upgrade override. This can be done using the Azure CLI:
az aks update --name aks-privatecluster --resource-group rg-privatecluster --enable-force-upgrade --upgrade-override-until 2025-02-24T18:00:00Z`
You can check these settings by quering the cluster using the Azure CLI:
azadmin@vm-jumpbox:~$ az aks show --resource-group rg-privatecluster --name aks-privatecluster --query upgradeSettings
{
"overrideSettings": {
"forceUpgrade": true,
"until": "2025-02-24T18:00:00+00:00"
}
}
> Note that the Microsoft documentation is a bit confusing. The [[https://learn.microsoft.com/en-us/azure/aks/stop-cluster-upgrade-api-breaking-changes |docs here]] got two commands mixed up. The [[https://docs.azure.cn/en-us/aks/stop-cluster-upgrade-api-breaking-changes |docs here]] shows that you need to set the upgrade override using the az aks update command after wich you can continue doing the upgrade, which we're doing with terraform.
== Terraform Plan ==
Now the output of terraform plan will be subject to your environment. Ideally, it will show you the upgrade of the cluster and the nodepools. However, some Azure resources might be dependent on the cluster and will need a refresh or update. This can cause a lot of changes in the plan:
Plan: 19 to add, 5 to change, 19 to destroy.
In our case, this was because we've enabled workload identity in the cluster, and all the extra updates are because of the workload identities and the federated credentials. Normally, this all goes well, so no worries and continue. We had one issue with a federated credential, which is shown below at troubleshooting.
== Terraform Apply ==
During the terraform apply, the cluster will be upgraded first, followed by the system nodepool and then the user nodepool. You can monitor the upgrade by checking the node status:
azadmin@vm-jumpbox:~$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-system-34974014-vmss000000 Ready agent 76m v1.28.15 172.16.48.103 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-system-34974014-vmss000001 Ready agent 73m v1.28.15 172.16.48.5 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-system-34974014-vmss000002 Ready agent 67m v1.28.15 172.16.48.54 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss000001 Ready agent 54m v1.28.15 172.16.64.4 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss000002 Ready agent 43m v1.28.15 172.16.64.102 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001r Ready agent 38m v1.28.15 172.16.64.53 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001t Ready agent 33m v1.28.15 172.16.64.249 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001u Ready agent 27m v1.28.15 172.16.65.42 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001v Ready agent 12m v1.28.15 172.16.64.151 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001x Ready agent 7m51s v1.28.15 172.16.65.91 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001y Ready agent 4m2s v1.28.15 172.16.65.189 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss00001z Ready agent 25s v1.28.15 172.16.65.140 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
aks-user-17016665-vmss000020 Ready,SchedulingDisabled agent 8d v1.27.7 172.16.64.200 Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
aks-user-17016665-vmss000022 Ready agent 62m v1.28.15 172.16.65.238 Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
You can also check the status of the upgrade in the Azure CLI:
azadmin@vm-jumpbox:~$ az aks nodepool show --resource-group rg-privatecluster --cluster-name aks-privatecluster --name user --output table
Name OsType KubernetesVersion VmSize Count MaxPods ProvisioningState Mode
------ -------- ------------------- --------------- ------- --------- ------------------- ------
user Linux 1.28.15 Standard_D4s_v3 11 50 Upgrading User
Once the upgrade is finished, all nodes have their version upgraded and the nodepool ProvisioningState is Succeeded:
azadmin@vm-jumpbox:~$ az aks nodepool show --resource-group rg-privatecluster --cluster-name aks-privatecluster --name user --output table
Name OsType KubernetesVersion VmSize Count MaxPods ProvisioningState Mode
------ -------- ------------------- --------------- ------- --------- ------------------- ------
user Linux 1.28.15 Standard_D4s_v3 10 50 Succeeded User
== Troubleshooting ==
As mentioned before, we had an issue with a federated credential. This was probably caused because of the timeouts we encountered. Once the upgrade was done, we checked if everything was ok by running a terraform plan and we got the message that one of the federated credentials was missing, but when running the apply, it said the resource already existed. We fixed this by importing the resource:
# Import federated credential
terraform import -var-file=env/dev.tfvars \
'module.federated_identity_credentials["fc-grafana"].azurerm_federated_identity_credential.federated_identity_credential' \
/subscriptions/30b3c71d-a123-a123-a123-abcd12345678/resourceGroups/rg-privatecluster/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-workload-grafana/federatedIdentityCredentials/fc-grafana