Fix delays and iterim fix for the partial fix in #3502 (#3511)

This patch uses a technique where in a retryable storage
before object layer initialization has a higher delay
and waits for longer period upto 4 times with time unit
of seconds.

And uses another set of configuration after the disks
have been formatted, i.e use a lower retry backoff rate
and retrying only once per 5 millisecond.

Network IO error count is reduced to a lower value i.e 256
before we reject the disk completely. This is done so that
combination of retry logic and total error count roughly
come to around 2.5secs which is when we basically take the
disk offline completely.

NOTE: This patch doesn't fix the issue of what if the disk
is completely dead and comes back again after the initialization.
Such a mutating state requires a change in our startup sequence
which will be done subsequently. This is an interim fix to alleviate
users from these issues.
This commit is contained in:
Harshavardhana
2016-12-30 17:08:02 -08:00
committed by GitHub
parent dd68cdd802
commit 8562b22823
9 changed files with 191 additions and 68 deletions

View File

@@ -89,14 +89,6 @@ var (
// List of admin peers.
globalAdminPeers = adminPeers{}
// Attempt to retry only this many number of times before
// giving up on the remote disk entirely.
globalMaxStorageRetryThreshold = 3
// Attempt to retry only this many number of times before
// giving up on the remote RPC entirely.
globalMaxAuthRPCRetryThreshold = 1
// Add new variable global values here.
)