Wednesday, 22 April 2026

Handle Nginx Gateway Certificate Refresh While Inplace Upgrade in AKS

 In the post "High Availability Deployment of Nginx Gateway Fabric Replacing Retired Ingress Nginx in AKS - Part 2 - Deploy Nginx-Gateway-Fabric" we hav discussed how to get nginx gateway setup in AKS. This approach works fine for the first install and if you are using true blue green with a fresh AKS cluster. However, when we use componenets such as elastic search  on AKS (which we will discuss in future posts how to setup elastc search on AKS) ,we have to use inplace AKS upgrades, with new node pools in same cluster, as we want to persist the data on elastic. In such inplace AKS upgrade requirements we will have to upgrade cert manager and nginx gateway as well inplace. When we try to do such upgrades to cert-manager and nginx gateway we are running into a issue as decribed below.

The Issue

Immediately after the upgrade or after a time interval, the dataplane pods of nginx gteway will run into a high CPU situation and will try to create pods. These pods will not be able to start properly as it they will not be able to validate the certificates generated. Ideally this situation should have been handled by the control plane (operator) of niginx gateway. however, it does not do that properly.




2026-04-22T12:45:12.795Z        warn    grpc@v1.79.3/clientconn.go:1525 [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.38.165:443", ServerName: "ngf-nginx-gateway-fabric.nginx-gateway.svc", }. Err: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"nginx-gateway\")"       {"resource": {"service.instance.id": "bec5c40e-a45a-4d3c-b8d0-81f61c60da99", "service.name": "otel-nginx-agent", "service.version": "v3.8.0"}, "grpc_log": true}
time=2026-04-22T12:45:24.582Z level=ERROR msg="Failed to create connection" error="rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \\\"crypto/rsa: verification error\\\" while trying to verify candidate authority certificate \\\"nginx-gateway\\\")\"" correlation_id=0c9ccc17-3e49-11f1-a1b1-6eee5e567dc2 server_type=command

The solution

We have to ensure once upgrade is done the current certificates (secrets in AKS) used by nginx gateway is deleted. When we do this operator automatically creates new certificates. Then we should restart the control plane and the data plane, which will make both control plane and data plane to use the new secrets. We can modify the script we used in the post "High Availability Deployment of Nginx Gateway Fabric Replacing Retired Ingress Nginx in AKS - Part 2 - Deploy Nginx-Gateway-Fabric" and add below part to ensure the certificates are deleted and recreated and the pods of contorl and data plane are restarted after certificates are recreated.

#region Nginx-Gateway with Nginx-Gateway-Fabric
$nginxGatewayFabricPrerequisites = -join($ManifestPath,'nginx_gateway/','nginx_gateway_fabric_prerequisites.yaml');
$nginxGatewayFabricHelmValuesManifest = -join($ManifestPath,'nginx_gateway/','nginx_gateway_fabric_helm_values.yaml');
$nginxGatewaySetupManifest = -join($ManifestPath,'nginx_gateway/','nginx_gateway_setup.yaml');

Write-Host (-join('Deploying Nginx-Gateway-Fabric prerequisites with: ',$nginxGatewayFabricPrerequisites, ' ...'));
kubectl apply -f $nginxGatewayFabricPrerequisites;
Write-Host ('Successfully deployed Nginx-Gateway-Fabric prerequisites.');
Write-Host ('=========================================================');

Write-Host ('Deploying Nginx-Gateway-Fabric with helm...');
helm upgrade ngf oci://ghcr.io/nginx/charts/nginx-gateway-fabric --install `
    --namespace nginx-gateway `
    --version 2.5.1 `
    -f $nginxGatewayFabricHelmValuesManifest `
    --set nginx.service.type="LoadBalancer" `
    --set nginx.service.loadBalancerIP=$nginxGatewayLoadBalancerIp `
    --set nginxGateway.replicas=3 `
    --set nginxGateway.snippets.enable=true;

Invoke-AKS-App-Health-Check -aksNamespace 'nginx-gateway' -apps @('nginx-gateway-fabric') -appReadyInitialWaitSeconds 20 -appHealthCheckMaxAttempts 60;
Write-Host ('Successfully deployed Nginx-Gateway-Fabric via helm.');
Write-Host ('=========================================================');

Write-Host ('Deploying Nginx-Gateway with: ',$nginxGatewaySetupManifest, ' ...');
kubectl apply -f $nginxGatewaySetupManifest;

# New script section begins here
# Here we add the new code segement to ensure certs are deleted
# Note the name selfhost-apps-gateway-nginx-agent-tls is changed 
# depending on the name you give to the gateway
Write-Host ('--------------------------------------------------------'); Write-Host ('Cleaning up existing secrets for nginx-gateway if any...'); $nginxSecretsToDelete = @( 'selfhost-apps-gateway-nginx-agent-tls', 'nginx-gateway-ca', 'agent-tls', 'server-tls' ); Write-Host ('Existing secrets in nginx-gateway namespace before cleanup are:'); kubectl get secret -n nginx-gateway; Write-Host ('--------------------------------------------------------'); $existingNginxSecrets = kubectl get secret -n nginx-gateway -o json | ConvertFrom-Json; if (($null -eq $existingNginxSecrets) -or ($null -eq $existingNginxSecrets.items) -or ($existingNginxSecrets.items.Count -le 0)) { Write-Host ('No existing secrets found in nginx-gateway namespace. Skipping deletion.'); } else { $existingNginxSecretNames = $existingNginxSecrets.items | ForEach-Object { $_.metadata.name }; foreach ($nginxSecret in $nginxSecretsToDelete) { if ($existingNginxSecretNames -contains $nginxSecret) { Write-Host ("Deleting secret: $nginxSecret"); kubectl delete secret $nginxSecret -n nginx-gateway; } else { Write-Host ("Secret not found, skipping: $nginxSecret"); } } } Write-Host ('Waiting for secrets to be automatically recreated by nginx gateway fabric operator if they are deleted...'); start-sleep -Seconds 10; Write-Host ('Secrets deleted and recreated automatically. Current secrets in nginx-gateway namespace are:'); Write-Host ('--------------------------------------------------------'); kubectl get secret -n nginx-gateway; Write-Host ('--------------------------------------------------------'); Write-Host ('Successfully refreshed secrets for nginx-gateway.'); Write-Host ('========================================================='); Write-Host ('Existing secrets are refreshed. Restarting nginx gateway fabric...'); kubectl rollout restart deployment/ngf-nginx-gateway-fabric -n nginx-gateway;
# Once we complete the cert refresh and restart of control plane and then data plane 
# we can check if the data plane pods are started and load balancer is working as expected
# New script section ends here

Invoke-AKS
-App-Health-Check -aksNamespace 'nginx-gateway' -apps @('nginx-gateway-fabric') -appReadyInitialWaitSeconds 20 -appHealthCheckMaxAttempts 60; Write-Host ('Successfully restarted Nginx-Gateway-Fabric.') Write-Host ('========================================================='); Write-Host ('Existing secrets are refreshed. Restarting nginx gateway...'); kubectl rollout restart deployment/selfhost-apps-gateway-nginx -n nginx-gateway; Invoke-AKS-App-Health-Check -aksNamespace 'nginx-gateway' -apps @('selfhost-apps-gateway') -appLabelName 'gateway.networking.k8s.io/gateway-name' -appReadyInitialWaitSeconds 30 -appHealthCheckMaxAttempts 60; Invoke-AKS-Load-Balancer-Health-Check -loadBalancerServiceName 'selfhost-apps-gateway-nginx' -loadBalancerIP $nginxGatewayLoadBalancerIp -aksNamespace 'nginx-gateway'; Write-Host ('Successfully deployed Nginx-Gateway.'); Write-Host ('========================================================='); #endregion Nginx-Gateway with Nginx-Gateway-Fabric

Check the post "High Availability Deployment of Nginx Gateway Fabric Replacing Retired Ingress Nginx in AKS - Part 2 - Deploy Nginx-Gateway-Fabric" to understand the full setup of the script. Addtional part is marked in above script section.

With this change upgrade of cert manager and nginx gateway runs smoothly without any issues.

No comments:

Popular Posts