pod/ liveness-probe(self-healing)

k8s/concept

pod/ liveness-probe(self-healing)

부엉이사장 2023. 11. 27. 16:04

# 기본개념

pod안에있는 컨테이너가 자체적으로 자가검진을 해서 원하는 기능이 동작안할경우 컨테이너를 재시작 해주는 기능이다.

예를들어 웹서버를 띄웠는데 80포트로 연결했는데 응답이 500번대가 뜨면 컨테이너를 재시작 해주는 거임.

# 종류

- httpGet

웹사이트에 80번포트로 게쏙 get요청보내고 응답하는지 확인함.

200응답안주면 컨테이너 재시작해주는거다.

- tcpSocket

tcp연결을 지정한 포트로 시도하고 안되면 컨테이너 재시작

- exec

커맨드 게속 날리고 종료코드가 0이아니면 컨테이너 재시작

걍 커맨드 날렸는데 응답이 없으면 재시작하는 그런 기능이다.

# http로 예를 들어볼까?

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - image: nginx
    name: test-pod
    ports:
    - containerPort: 80
      protocol: TCP
    livenessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 0
      periodSeconds: 1
      timeoutSeconds: 1
      successThreshold: 1
      failureThreshold: 1

livenessProbe부분은 containers 아래에 들어가게 된다.

방법은 httpGet으로 웹서버를 테스트 할거고 path /으로 80번포트로 http요청 get요청을 보낼건데 응답이 없으면 재시작하는거다.

kubectl create -f nginx-pod.yaml

이렇게 실행하면 동작을 한다.

# 저 밑에 이니셜 ~ 이런건 뭐야?

Configure Probes

initialDelaySeconds: Number of seconds after the container has started before startup, liveness or readiness probes are initiated. If a startup probe is defined, liveness and readiness probe delays do not begin until the startup probe has succeeded. If the value of periodSeconds is greater than initialDelaySeconds then the initialDelaySeconds would be ignored. Defaults to 0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. The minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup Probes. Minimum value is 1.
failureThreshold: After a probe fails failureThreshold times in a row, Kubernetes considers that the overall check has failed: the container is not ready/healthy/live. For the case of a startup or liveness probe, if at least failureThreshold probes have failed, Kubernetes treats the container as unhealthy and triggers a restart for that specific container. The kubelet honors the setting of terminationGracePeriodSeconds for that container. For a failed readiness probe, the kubelet continues running the container that failed checks, and also continues to run more probes; because the check failed, the kubelet sets the Ready condition on the Pod to false.
terminationGracePeriodSeconds: configure a grace period for the kubelet to wait between triggering a shut down of the failed container, and then forcing the container runtime to stop that container. The default is to inherit the Pod-level value for terminationGracePeriodSeconds (30 seconds if not specified), and the minimum value is 1. See probe-level terminationGracePeriodSeconds for more detail.

공식문서에는 이렇게 나와있다.

한글로 대충 풀어보자면

initialDelaySeconds : 처음 시작하고 몇초후에 검진할래?
timeoutSeconds : 응답을 기다리는 시간 maximum. 만약 원하는 응답이 이 시간동안 없다면 unhealthy 로 해버림. fail
periodSeconds : 몇초마다 한번씩 검사할래?
successThreshold : 몇번 검진ok면 성공으로 보겠다
failureThreshold : 몇번 검진 실패하면 fail로 보겠다.

근데 단순하게 yaml파일 작성하고 nginx로 띄웠을땐 제대로 동작하는건지 모른다.

그래서 저런거 실습하라고 나온 이미지가 있다.

이름하야 smlinux/unhealthy

얘는 다섯번 http요청까지는 200번을 응답하다가, 그이후로는 500번 에러를 띄운다.

즉 처음엔 running이다가 다섯번 요청받고나서 unhealthy가 될거다.

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: test-pod
  name: test-pod
spec:
  containers:
  - image: smlinux/unhealthy
    name: test-pod
    ports:
    - containerPort: 8080
      protocol: TCP
    livenessProbe:
      httpGet:
        path: /
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 1
      timeoutSeconds: 1
      successThreshold: 1
      failureThreshold: 1

실험을 해보면

kubectl create -f test-pod.yaml

kubectl get pod -o wide

재시작 횟수를 보면 6번이나 재시작됐다.

컨테이너가 500번을 응답하면 자동으로 재시작하게 해주기때문에 이렇게 되는거다.

kubectl describe pod test-pod

test-pod의 상태를 describe로 살펴봤을때도 sccess였다가 killing하고(500응답) 다시 pulling이미지 등 반복작업을 하게된다.

사실 다섯번 http요청하고 200번요청을 받는다고 했는데 난 저기서 period를 1초로 해놨었다.

근데 50초쯤 재시작을 하는거보면 이건 50초후에 500응답하는것같음.

# exec도 테스트해보자

apiVersion: v1
kind: Pod
metadata:
  name: liveness-exam
spec:
  containers:
  - name: busybox-container
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 1
      successThreshold: 1
      failureThreshold: 2

이 yaml파일을 살펴보자면..

이미지는 busybox컨테이너로 만들었고, 이 컨테이너를 생성할때 args로

커맨드를 넣어줬다.

touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600

이걸보면 처음 /tmp/healthy라는 파일을 만들어주고 30초를 쉰다음, 이 파일을 삭제해준다.

여기서 livenessProbe로 command부분에서 이 파일을 검사하게 되는데

만약 이 파일이 없다면 실패라고 볼거다.

처음 10초 딜레이되고 5초마다 이 명령을 쳐주면서 검사할거임.

처음 30초동안은 이 파일이 살아있을테니 성공으로 봐줄테고, 그 이후에는 rm -rf로 삭제해버렸으니 fail로 볼거다

실제로 테스트를 해보면

컨테이너가 정상이다가..

스샷을 놓쳤는데 70초쯤 재시작카운트가 하나 증가했다.

내예상으론 40초쯤 재시작될줄알았는데 컨테이너 죽이고 다시 실행하는데 딜레이가 있나 싶다..

# 프로브는 파드 하나당 하나만 넣을 수 있다.

야믈 형식에서 liveness프로브를 하나만 넣을 수 있다.

예를들어 httpGet과 exec 이렇게 두개를 넣을 수 없다는 얘기다.

이렇게 하려다가 컨테이너가 생성이 안되더라.