반응형
Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
Tags
- Kubeflow
- tekton
- kubernetes operator
- 오퍼레이터
- argo rollout
- Kopf
- Litmus
- Kubernetes
- blue/green
- CANARY
- serving
- Model Serving
- nginx ingress
- 카오스 엔지니어링
- knative
- operator
- argocd
- keda
- Kubernetes 인증
- Continuous Deployment
- CI/CD
- MLflow
- opensearch
- Pulumi
- seldon core
- mlops
- opentelemetry
- gitea
- Argo
- gitops
Archives
- Today
- Total
Kubernetes 이야기
Prometheus에서 주요 alert rule 본문
반응형
Kubernetes에서 다양한 장애를 빠르게 인지하기 위해 대부분 prometheus+alert manager 를 사용하여 알람을 받도록 설정한다.
이번에는 장애를 감지하기 위한 주요한 룰을 알아보자.
alert rule 설정은 https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 를 참고한다. 주요한 expr만 알아보자.
Prometheus Rule 등록과 AlertManager 설정
PrometheusRule 형식
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app: prometheus
role: alert-rules
name: node-rule
spec:
groups:
- name: alerting_rules
rules:
- alert: node_alert
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
expr: kube_node_status_condition{condition="Ready",status="true"} > 1
for: 10m
labels:
severity: critical
AlertManager 설정
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: alertmanager-config
labels:
alertmanagerConfig: alert
spec:
route:
groupWait: 30s
groupInterval: 5m
repeatInterval: 1h
receiver: 'email-notifications'
receivers:
- name: email-notifications
emailConfigs:
- sendResolved: false
to: <email>
from: <email>
hello: localhost
smarthost: smtp.gmail.com:587
authUsername: <email>
authPassword:
name: gmail-secret
key: password
# kubectl create secret generic gmail-secret --from-literal=password=APP_PASSWORD
또는 아래와 같이 secret를 yaml 로 저장
수정 전 반드시 secret을 백업 받은 후 작업한다.
# kubectl get secret -n kube-system alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alert.yaml
receiver 등 수정 후
# kubectl create secret generic alertmanager-main --from-literal=alertmanager.yaml="$(< alertmanager.yaml)" --dry-run -o=yaml | kubectl -n kube-system replace secret --filename=-
Expression 예제
Node ready
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
Memory pressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
Disk pressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
Out of disk
expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
Out of capacity
expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
Container oom killer
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
Job failed
expr: kube_job_status_failed > 0
Cronjob suspended
expr: kube_cronjob_spec_suspend != 0
PersistentVolumeClaim pending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
PersistentVolume error
expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0
StatefulSet down
expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
Readiness probe 실패
expr : sum by(pod)( kube_pod_info{created_by_kind!="Job"} AND ON (pod, namespace) kube_pod_status_ready{condition="false"} == 1) > 0
HPA scaling ability
kube_horizontalpodautoscaler_status_condition{status="false", condition="AbleToScale"} == 1
Pod not healthy
min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
Pod crash looping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
참고
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://github.com/prometheus-operator/prometheus-operator/tree/main/example/user-guides/alerting
https://awesome-prometheus-alerts.grep.to/rules.html
반응형
'Kubernetes > 모니터링' 카테고리의 다른 글
Goldilocks로 VPA 모니터링 하기 (0) | 2022.05.15 |
---|---|
Prometheus-operator를 사용하여 Prometheus 및 Grafana 설치 (0) | 2022.05.09 |
OpenTelemetry (0) | 2022.04.16 |
OpenTelemetry auto-instrumentation (0) | 2022.04.16 |
host 서버에서 pod내부의 외부 통신 상태 (netstat) 조회 (0) | 2022.03.22 |
Comments