Wednesday, 5 March 2025

Setting Up Alert for AKS Pod Restarts Using Log Analytics Workspace and Grafana

 Azure Kubernetes Services (AKS)  pod restarts can be obtained from the KubePodInventory of the connected log analytics workspace. This data can be depicted in a graph in grafana as described in the post "Pod Restart Counts Grafana Chart with Azure Monitor for AKS". Let's explore how to use same information to create an alert using Grafana to notify when pod restarts are happening in apps in a given kubernetes namespace. 

The expectation is to fire alerts from Grafana as shwon below. Note that the alerts can target to send emails, slack notficaition etc. which is not discussed in this post.


The below query can be used to identify new pod restarts between now and two minutes (150 seconds picks up 2 minutes or very rearely a minute before data for previous) before now for each app running in a given namespace.

KubePodInventory
| where Namespace == 'demo'
| extend pod_label = todynamic(PodLabel)
| extend app_name = todynamic(pod_label[0].app)
| summarize pod_restarts = sum(PodRestartCount) by TimeGenerated, tostring(app_name)
| project TimeGenerated, app_name = tostring(app_name), pod_restarts
| join kind=inner (
    KubePodInventory
    | where Namespace == 'demo'
    | summarize TimeGenerated = max(TimeGenerated)
    | project TimeGenerated 
    )
    on TimeGenerated
| project TimeGenerated, app_name, current_pod_restarts = pod_restarts
| join kind=inner (
    KubePodInventory
    | where Namespace == 'demo'
    | extend pod_label = todynamic(PodLabel)
    | extend app_name = todynamic(pod_label[0].app)
    | summarize pod_restarts = sum(PodRestartCount) by TimeGenerated, tostring(app_name)
    | project TimeGenerated, app_name = tostring(app_name), pod_restarts
    | join kind=inner (
        KubePodInventory
        | where Namespace == 'demo'
        | where TimeGenerated between (ago(5m) .. ago(150s))
        | summarize TimeGenerated = max(TimeGenerated)
        | project TimeGenerated 
        )
        on TimeGenerated
    | project previous_time =TimeGenerated, app_name, previous_pod_restarts = pod_restarts
    )
    on app_name
| extend new_pod_restarts = current_pod_restarts - previous_pod_restarts
| extend app_message = strcat('between utc ', tostring(previous_time), ' and ', tostring(TimeGenerated)) 
| order by TimeGenerated asc 
| project TimeGenerated, app_name, app_message, new_pod_restarts


The above query can be used in Grafana alert as shown below with Time Series as output format.


Te query will output data with app_message containing the period considered for evaluating restarts, the app_name gives the name of kubernetes app. The new_pod_restarts show the number of restarts, between the last five minutes and now.


Once, query is pasted and set Time Series as output Grafana will automatically set the correct expression rules.


The second expression is set as the alert condition.


The evaluation group  helps to define when to alert. The group is set to evaluate at each one minute and pending perioed to fire alter is set to one minute here. The values here, and the 2 minute interval considertation in the query helps to effectively identify pod restart behavirour that should be alerted to the monitoring/support team. Thes values in the query and here can be altred and fine tuned based on the needs to increase the effectiveness.


The current setting set to pending alert when a new pod restart detected in any app in the 1st minute, compared to pod restart count 2 minutes ago. The alert go to pending state. Then it will evaluate in next minute and will see the pod reatsrt count is still higher compared to 2 minutes ago (which is one minute ago compared to orginal detection time). The alert will be fired and in the next minute the value of restart will be evaluate to equal if there are no new restarts. Can rearely trigger alert twice for same value. However, this setup will detect any new restart effectively.



Group settings are below.


Labels can be configured to setup delivery of alert beased on lable evaluation which is not discussed in this post.


A summary can be added to alert with infromation of the alert as shwon below.



If alert is delevered to slack we can see the details in the summary as shown below. Note that the time gap is set to many days for testing the alert here. But it should be set as expained above to make it effective.




No comments:

Popular Posts