fix: readiness needs to be like liveness (#9941)

Readiness as no reasoning to be cluster scope
because that is not how the k8s networking works
for pods, all the pods to a deployment are not
sharing the network in a singleton. Instead they
are run as local scopes to themselves, with
readiness failures the pod is potentially taken
out of the network to be resolvable - this
affects the distributed setup in myriad of
different ways.

Instead readiness should behave like liveness
with local scope alone, and should be a dummy
implementation.

This PR all the startup times and overal k8s
startup time dramatically improves.

Added another handler called as `/minio/health/cluster`
to understand the cluster scope health.
This commit is contained in:
Harshavardhana
2020-06-30 11:28:27 -07:00
committed by GitHub
parent 27a1f3ed2b
commit c0ac25bfff
6 changed files with 46 additions and 39 deletions

View File

@@ -1,35 +1,42 @@
## MinIO Healthcheck
MinIO server exposes two un-authenticated, healthcheck endpoints - liveness probe and readiness probe at `/minio/health/live` and `/minio/health/ready` respectively.
MinIO server exposes three un-authenticated, healthcheck endpoints liveness probe, readiness probe and a cluster probe at `/minio/health/live`, `/minio/health/ready` and `/minio/health/cluster` respectively.
### Liveness probe
This probe is used to identify situations where the server is running but may not behave optimally, i.e. sluggish response or corrupt back-end. Such problems can be *only* fixed by a restart.
This probe always responds with '200 OK'. When liveness probe fails, Kubernetes like platforms restart the container.
Internally, MinIO liveness probe handler checks if backend is alive and in read quorum to take requests.
When liveness probe fails, Kubernetes like platforms restart the container.
```
livenessProbe:
httpGet:
path: /minio/health/live
port: 9000
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 1
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
```
### Readiness probe
This probe is used to identify situations where the server is not ready to accept requests yet. In most cases, such conditions recover in some time such as quorum not available on drives due to load.
Internally, MinIO readiness probe handler checks for backend is alive and in read quorum then the server returns 200 OK, otherwise 503 Service Unavailable.
Platforms like Kubernetes *do not* forward traffic to a pod until its readiness probe is successful.
### Configuration example
Sample `liveness` and `readiness` probe configuration in a Kubernetes `yaml` file can be found [here](https://github.com/minio/minio/blob/master/docs/orchestration/kubernetes/minio-standalone-deployment.yaml).
### Configure readiness deadline
Readiness checks need to respond faster in orchestrated environments, to facilitate this you can use the following environment variable before starting MinIO
This probe always responds with '200 OK'. When readiness probe fails, Kubernetes like platforms *do not* forward traffic to a pod.
```
MINIO_API_READY_DEADLINE (duration) set the deadline for health check API /minio/health/ready e.g. "1m"
readinessProbe:
httpGet:
path: /minio/health/ready
port: 9000
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 1
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
```
Set a *5s* deadline for MinIO to ensure readiness handler responds with-in 5seconds.
```
export MINIO_API_READY_DEADLINE=5s
```
### Cluster probe
This probe is not useful in almost all cases, this is meant for administrators to see if quorum is available in any given cluster. The reply is '200 OK' if cluster has quorum if not it returns '503 Service Unavailable'.