Kubernetes 이야기

Prometheus에서 주요 alert rule 본문

Kubernetes/모니터링

Prometheus에서 주요 alert rule

kmaster 2022. 5. 6. 19:20
반응형

Kubernetes에서 다양한 장애를 빠르게 인지하기 위해 대부분 prometheus+alert manager 를 사용하여 알람을 받도록 설정한다.

 

이번에는 장애를 감지하기 위한 주요한 룰을 알아보자. 

alert rule 설정은 https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 를 참고한다. 주요한 expr만 알아보자.

 

Prometheus Rule 등록과 AlertManager 설정

 

PrometheusRule 형식

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: prometheus
    role: alert-rules
  name: node-rule
spec:
  groups:
  - name: alerting_rules
    rules:
    - alert: node_alert
      annotations:
        summary: Kubernetes Node ready (instance {{ $labels.instance }})
        description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      expr: kube_node_status_condition{condition="Ready",status="true"} > 1
      for: 10m
      labels:
        severity: critical

 

AlertManager 설정

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-config
  labels:
    alertmanagerConfig: alert
spec:
  route:
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 1h
    receiver: 'email-notifications'
  receivers:
  - name: email-notifications
    emailConfigs:
    - sendResolved: false
      to: <email>
      from: <email>
      hello: localhost
      smarthost: smtp.gmail.com:587
      authUsername: <email>
      authPassword:
        name: gmail-secret
        key: password
# kubectl create secret generic gmail-secret --from-literal=password=APP_PASSWORD

 

또는 아래와 같이 secret를 yaml 로 저장

수정 전 반드시 secret을 백업 받은 후 작업한다.

 

# kubectl get secret -n kube-system alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alert.yaml

 

receiver 등 수정 후 

# kubectl create secret generic alertmanager-main --from-literal=alertmanager.yaml="$(< alertmanager.yaml)" --dry-run -o=yaml | kubectl -n kube-system replace secret --filename=-

 

Expression 예제

 

Node ready

expr: kube_node_status_condition{condition="Ready",status="true"} == 0

 

Memory pressure

expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1

 

Disk pressure

expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1

 

Out of disk

expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1

 

Out of capacity

expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90

 

Container oom killer

expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1

 

Job failed

expr: kube_job_status_failed > 0

 

Cronjob suspended

expr: kube_cronjob_spec_suspend != 0

 

PersistentVolumeClaim pending

expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1

 

PersistentVolume error

expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0

 

StatefulSet down

expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1

 

Readiness probe 실패

expr : sum by(pod)( kube_pod_info{created_by_kind!="Job"} AND ON (pod, namespace) kube_pod_status_ready{condition="false"} == 1) > 0

 

HPA scaling ability

kube_horizontalpodautoscaler_status_condition{status="false", condition="AbleToScale"} == 1

 

Pod not healthy

min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0

 

Pod crash looping

expr: increase(kube_pod_container_status_restarts_total[1m]) > 3

 

참고

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 

https://github.com/prometheus-operator/prometheus-operator/tree/main/example/user-guides/alerting

https://awesome-prometheus-alerts.grep.to/rules.html

 

 

 

반응형
Comments