Commit Graph

167 Commits

Author SHA1 Message Date
Harshavardhana f527c708f2
run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
Harshavardhana 4545ecad58
ignore swapped drives instead of throwing errors (#13655)
- add checks such that swapped disks are detected
  and ignored - never used for normal operations.

- implement `unrecognizedDisk` to be ignored with
  all operations returning `errDiskNotFound`.

- also add checks such that we do not load unexpected
  disks while connecting automatically.

- additionally humanize the values when printing the errors.

Bonus: fixes handling of non-quorum situations in
getLatestFileInfo(), that does not work when 2 drives
are down, currently this function would return errors
incorrectly.
2021-11-15 09:46:55 -08:00
Harshavardhana 8bb52c9c2a
fix: ignore disks that are available but not writable (#13585)
This is to allow replacing drives while some drives
while available are not writable.
2021-11-04 16:42:49 -07:00
Klaus Post d9c1d79e30
Protect logger targets (#13529)
Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`.

Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.
2021-10-28 07:35:28 -07:00
Harshavardhana 0c48b1d993 fix: benchmarking test initialization
> go test -run=none -bench=Benchmark github.com/minio/minio/cmd

Runs now without any crashes.

fixes #13380
2021-10-08 11:38:30 -07:00
Klaus Post 421160631a
MakeBucket: Delete leftover buckets on error (#13368)
In (erasureServerPools).MakeBucketWithLocation deletes the created 
buckets if any set returns an error.

Add `NoRecreate` option, which will not recreate the bucket 
in `DeleteBucket`, if the operation fails.

Additionally use context.Background() for operations we always want to be performed.
2021-10-06 10:24:40 -07:00
Harshavardhana fabf60bc4c
fix: allow configuring cleanup of stale multipart uploads (#13354)
allow dynamically changing cleanup of stale multipart
uploads, their expiry and how frequently its checked.

Improves #13270
2021-10-04 10:52:28 -07:00
Anis Elleuch 1d9e91e00f
Fix wrong reporting of total disks after restart (#13326)
A restart of the cluster and a failed disk will wrongly count 
the number of total disks.
2021-09-29 11:36:19 -07:00
Anis Elleuch 68a2d6fc40
xl: Avoid empty endpoints (#13299)
An endpoint can be empty when a disk is offline or something 
wrong with it. Avoid it by filling erasureSets.endpointStrings 
with values from arguments.
2021-09-25 10:51:03 -07:00
Harshavardhana 1a884cd8e1
fix: deleting objects was not working after upgrades (#13242)
DeleteObject() on existing objects before `xl.json` to
`xl.meta` change were not working, not sure when this
regression was added. This PR fixes this properly.

Also this PR ensures that we perform rename of xl.json
to xl.meta only during "write" phase of the call i.e
either during Healing or PutObject() overwrites.

Also handles few other scenarios during migration where
`backendEncryptedFile` was missing deleteConfig() will
fail with `configNotFound` this case was not ignored,
which can lead to failure during upgrades.
2021-09-17 19:34:48 -07:00
Harshavardhana 6d42569ade
remove ListBucketsMetadata instead add them to AccountInfo() (#13241) 2021-09-17 15:02:21 -07:00
Harshavardhana 45bcf73185
feat: Add ListBucketsWithMetadata extension API (#13219) 2021-09-16 09:52:41 -07:00
Harshavardhana 787a72a993
make sure to ignore the rootDisk when healing drives (#13209)
fixes #13208
2021-09-14 15:10:00 -07:00
Harshavardhana 035882d292
fix: remove parentIsObject() check (#12851)
we will allow situations such as

```
a/b/1.txt
a/b
```

and

```
a/b
a/b/1.txt
```

we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.

rest of the situations where the prefixes get
created across sets are supported as is.
2021-08-03 13:26:57 -07:00
Anis Elleuch b0b4696a64
heal: Add MRF metrics to background heal API response (#12398)
This commit gathers MRF metrics from 
all nodes in a cluster and return it to the caller. This will show information about the 
number of objects in the MRF queues 
waiting to be healed.
2021-07-15 22:32:06 -07:00
Harshavardhana 4669d19f2a
fix: simplify diskMap usage to keep certain checks predictable (#12519)
Bonus: also make sure that we Sanitize() the drives only during
startup of the server, but not during disk reconnects.
2021-06-16 14:26:26 -07:00
Anis Elleuch 7722b91e1d
s3: Force a prefix removal using a special header (#12504)
An S3 client can send `x-minio-force-delete: true` to remove a prefix.
2021-06-15 18:43:14 -07:00
Harshavardhana a93aa2eac1
fix: upon failure attempt an undo for all calls in DeleteBucket() (#12480)
its possible that, version might exist on second pool such that
upon deleteBucket() might have deleted the bucket on pool1 successfully
since it doesn't have any objects, undo such operations properly in
all any error scenario.

Also delete bucket metadata from pool layer rather than sets layer.
2021-06-09 17:13:00 -07:00
Anis Elleuch 8e9e028c0c
fix: safe update of the audit objectErasureMap (#12477)
objectErasureMap in the audit holds information about the objects
involved in the current S3 operation such as pool index, set an index,
and disk endpoints. One user saw a crash due to a concurrent update of
objectErasureMap information. Use sync.Map to prevent a crash.
2021-06-09 10:51:19 -07:00
Harshavardhana 542fe4ea2e
fix: legacy objects with 10MiB blockSize should use right buffers (#12459)
healing code was using incorrect buffers to heal older
objects with 10MiB erasure blockSize, incorrect calculation
of such buffers can lead to incorrect premature closure of
io.Pipe() during healing.

fixes #12410
2021-06-07 10:06:06 -07:00
Harshavardhana 1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Harshavardhana 81d5688d56
move the dependency to minio/pkg for common libraries (#12397) 2021-05-28 15:17:01 -07:00
Anis Elleuch 56d4d7b8b1
MRF: Better detection of non stable disks (#12252)
MRF does not detect when a node is disconnected and reconnected quickly
this change will ensure that MRF is alerted by comparing the last disk
reconnection timestamp with the last MRF check time.

Signed-off-by: Anis Elleuch <anis@min.io>

Co-authored-by: Klaus Post <klauspost@gmail.com>
2021-05-11 09:19:15 -07:00
Harshavardhana 1aa5858543
move madmin to github.com/minio/madmin-go (#12239) 2021-05-06 08:52:02 -07:00
Krishnan Parthasarathi c829e3a13b Support for remote tier management (#12090)
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.

Some new additions include:

 - Admin APIs remote tier configuration management

 - Simple journal to track remote objects to be 'collected'
   This is used by object API handlers which 'mutate' object versions by
   overwriting/replacing content (Put/CopyObject) or removing the version
   itself (e.g DeleteObjectVersion).

 - Rework of previous ILM transition to fit the new model
   In the new model, a storage class (a.k.a remote tier) is defined by the
   'remote' object storage type (one of s3, azure, GCS), bucket name and a
   prefix.

* Fixed bugs, review comments, and more unit-tests

- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Harshavardhana 069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Harshavardhana e85b28398b
fix: pre-allocate certain slices with expected capacity (#12044)
Avoids append() based tiny allocations on known
allocated slices repeated access.
2021-04-12 13:45:06 -07:00
Klaus Post 111c02770e
Fix data race when connecting disks (#11983)
Multiple disks from the same set would be writing concurrently.

```
WARNING: DATA RACE
Write at 0x00c002100ce0 by goroutine 166:
  github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
      d:/minio/minio/cmd/erasure-sets.go:254 +0x82f

Previous write at 0x00c002100ce0 by goroutine 129:
  github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
      d:/minio/minio/cmd/erasure-sets.go:254 +0x82f

Goroutine 166 (running) created at:
  github.com/minio/minio/cmd.(*erasureSets).connectDisks()
      d:/minio/minio/cmd/erasure-sets.go:210 +0x324
  github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
      d:/minio/minio/cmd/erasure-sets.go:288 +0x244

Goroutine 129 (finished) created at:
  github.com/minio/minio/cmd.(*erasureSets).connectDisks()
      d:/minio/minio/cmd/erasure-sets.go:210 +0x324
  github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
      d:/minio/minio/cmd/erasure-sets.go:288 +0x244
```
2021-04-06 11:33:10 -07:00
Harshavardhana d46386246f
api: Introduce metadata update APIs to update only metadata (#11962)
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.

This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.

This helps reduce the chattiness for following calls..

- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
2021-04-04 13:32:31 -07:00
Anis Elleuch 2c296652f7
Simplify access to local node name (#11907)
The local node name is heavily used in tracing, create a new global 
variable to store it. Multiple goroutines can access it since it won't be
changed later.
2021-03-26 11:37:58 -07:00
Harshavardhana 90d8ec6310
fix: reject duplicate keys in PostPolicyJSON document (#11902)
fixes #11894
2021-03-25 13:57:57 -07:00
Anis Elleuch 14d89eaae4
mrf: Enhance behavior for better results (#11788)
MRF was starting to heal when it receives a disk connection event, which
is not good when a node having multiple disks reconnects to the cluster.

Besides, MRF needs Remove healing option to remove stale files.
2021-03-18 11:19:02 -07:00
Harshavardhana add3cd4e44
allow configuring delete cleanup interval from default 10minutes (#11818) 2021-03-17 15:15:58 -07:00
Anis Elleuch 57f3ed22d4
erasure: Reduce the interval of cleaning up .trash folder (#11741)
Reduce from 30 to 10 minutes.
2021-03-09 09:45:38 -08:00
Harshavardhana 9ccc483df6
[feat]: change erasure coding default block size from 10MiB to 1MiB (#11721)
major performance improvements in range GETs to avoid large
read amplification when ranges are tiny and random

```
-------------------
Operation: GET
Operations: 142014 -> 339421
Duration: 4m50s -> 4m56s
* Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s
* Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s
* 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s
* Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s
```

TTFB from 10MiB BlockSize
```
* First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s
```

TTFB from 1MiB BlockSize
```
* First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms
```

Full object reads however do see a slight change which won't be
noticeable in real world, so not doing any comparisons

TTFB still had improvements with full object reads with 1MiB

```
* First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s
```

v/s

TTFB with 10MiB
```
* First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s
```

This change should affect all new uploads, previous uploads should
continue to work with business as usual. But dramatic improvements can
be seen with these changes.
2021-03-06 14:09:34 -08:00
Harshavardhana d971061305
use listPathRaw for HealObjects() instead of expensive WalkVersions() (#11675) 2021-03-06 09:25:48 -08:00
Klaus Post fa9cf1251b
Imporve healing and reporting (#11312)
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
2021-03-04 14:36:23 -08:00
Harshavardhana c6a120df0e
fix: Prometheus metrics to re-use storage disks (#11647)
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo

also fixes bugs where there were double calls to StorageInfo()
2021-03-02 17:28:04 -08:00
Harshavardhana b690304eed
use faster way for siphash (#11640) 2021-02-26 16:53:06 -08:00
Harshavardhana 6386b45c08
[feat] use rename instead of recursive deletes (#11641)
most of the delete calls today spend time in
a blocking operation where multiple calls need
to be recursively sent to delete the objects,
instead we can use rename operation to atomically
move the objects from the namespace to `tmp/.trash`

we can schedule deletion of objects at this
location once in 15, 30mins and we can also add
wait times between each delete operation.

this allows us to make delete's faster as well
less chattier on the drives, each server runs locally
a groutine which would clean this up regularly.
2021-02-26 09:52:27 -08:00
Andreas Auernhammer 1f659204a2
remove GetObject from ObjectLayer interface (#11635)
This commit removes the `GetObject` method
from the `ObjectLayer` interface.

The `GetObject` method is not longer used by
the HTTP handlers implementing the high-level
S3 semantics. Instead, they use the `GetObjectNInfo`
method which returns both, an object handle as well
as the object metadata.

Therefore, it is no longer necessary that a concrete
`ObjectLayer` implements `GetObject`.
2021-02-26 09:52:02 -08:00
Harshavardhana a8e4f64ff3 Revert "fix: remove persistence layer for metacache store in memory (#11538)"
This reverts commit b23659927c.
2021-02-24 22:24:51 -08:00
Harshavardhana b23659927c
fix: remove persistence layer for metacache store in memory (#11538)
store the cache in-memory instead of disks to avoid large
write amplifications for list heavy workloads, store in
memory instead and let it auto expire.
2021-02-24 15:51:41 -08:00
Harshavardhana aa7244a9a4
fix: make sure to convert the error properly in HealBucket() (#11610)
server startup code expects the object layer to properly
convert error into a proper type, so that in situations when
servers are coming up and quorum is not available servers
wait on each other.
2021-02-23 09:23:11 -08:00
Klaus Post 8a6b13c239
Avoid synchronizing usage writes (#11560)
If the periodic `case <-t.C:` save gets held up for a long time it will end up 
synchronize all disk writes for saving the caches.

We add jitter to per set writes so they don't sync up and don't hold a 
lock for the write, since it isn't needed anyway.

If an outage prevents writes for a long while we also add individual 
waits for each disk in case there was a queue.

Furthermore limit the number of buffers kept to 2GiB, since this could get 
huge in large clusters. This will not act as a hard limit but should be enough 
for normal operation.
2021-02-18 00:38:37 -08:00
Krishnan Parthasarathi b87fae0049
Simplify PutObjReader for plain-text reader usage (#11470)
This change moves away from a unified constructor for plaintext and encrypted
usage. NewPutObjReader is simplified for the plain-text reader use. For
encrypted reader use, WithEncryption should be called on an initialized PutObjReader.

Plaintext:
func NewPutObjReader(rawReader *hash.Reader) *PutObjReader

The hash.Reader is used to provide payload size and md5sum to the downstream
consumers. This is different from the previous version in that there is no need
to pass nil values for unused parameters.

Encrypted:
func WithEncryption(encReader *hash.Reader,
key *crypto.ObjectKey) (*PutObjReader, error)

This method sets up encrypted reader along with the key to seal the md5sum
produced by the plain-text reader (already setup when NewPutObjReader was
called).

Usage:
```
  pReader := NewPutObjReader(rawReader)
  // ... other object handler code goes here

  // Prepare the encrypted hashed reader
  pReader, err = pReader.WithEncryption(encReader, objEncKey)

```
2021-02-10 08:52:50 -08:00
Harshavardhana 88c1bb0720
fix: improper ticker usage in goroutines (#11468)
- lock maintenance loop was incorrectly sleeping
  as well as using ticker badly, leading to
  extra expiration routines getting triggered
  that could flood the network.

- multipart upload cleanup should be based on
  timer instead of ticker, to ensure that long
  running jobs don't get triggered twice.

- make sure to get right lockers for object name
2021-02-05 19:23:48 -08:00
Anis Elleuch e96fdcd5ec
tagging: Add event notif for PUT object tagging (#11366)
An optimization to avoid double calling for during PutObject tagging
2021-02-01 13:52:51 -08:00
Harshavardhana 1e53bf2789
fix: allow expansion with newer constraints for older setups (#11372)
currently we had a restriction where older setups would
need to follow previous style of "stripe" count being same
expansion, we can relax that instead newer pools can be
expanded for older setups with newer constraints of
common parity ratio.
2021-01-29 11:40:55 -08:00
Harshavardhana 6717295e18 fix: rename audit log docs and datastructure 2021-01-26 13:39:55 -08:00
Anis Elleuch 00cff1aac5
audit: per object send pool number, set number and servers per operation (#11233) 2021-01-26 13:21:51 -08:00
Harshavardhana 9cdd981ce7
fix: expire locks only on participating lockers (#11335)
additionally also add a new ForceUnlock API, to
allow forcibly unlocking locks if possible.
2021-01-25 10:01:27 -08:00
Klaus Post dac19d7272
Clarify root disk error (#11314)
Make it clearer what the problem is and how to resolve it.
2021-01-20 13:11:42 -08:00
Harshavardhana 3ca6330661
fix: optimize parentDirIsObject by moving isObject to storage layer (#11291)
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.

Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.

Update tests to test multiple sets scenario
2021-01-18 12:25:22 -08:00
Harshavardhana 4315f93421
fix: make sure parentDirIsObject is used at set level (#11280)
parentDirIsObject is not using set level understanding
to check for parent objects, without this it can lead to
objects that can actually reside on a separate set as
objects and would conflict.
2021-01-17 01:11:48 -08:00
Harshavardhana f903cae6ff
Support variable server pools (#11256)
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.

This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.

If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
2021-01-16 12:08:02 -08:00
Harshavardhana e7ae49f9c9
fix: calculate prometheus disks_offline/disks_total correctly (#11215)
fixes #11196
2021-01-04 09:42:09 -08:00
Anis Elleuch 2ecaab55a6
admin: ServerInfo returns info without object layer initialized (#11142) 2020-12-21 09:35:19 -08:00
Harshavardhana f714840da7
add _MINIO_SERVER_DEBUG env for enabling debug messages (#11128) 2020-12-17 16:52:47 -08:00
Harshavardhana 7c9ef76f66
fix: timer deadlock on expired timers (#11124)
issue was introduced in #11106 the following
pattern

<-t.C // timer fired

if !t.Stop() {
   <-t.C // timer hangs
}

Seems to hang at the last `t.C` line, this
issue happens because a fired timer cannot be
Stopped() anymore and t.Stop() returns `false`
leading to confusing state of usage.

Refactor the code such that use timers appropriately
with exact requirements in place.
2020-12-17 12:35:02 -08:00
Harshavardhana b390a2a0b9
fix: reuser timers in erasure set hotpaths (#11106)
reuser timers in

 - connectDisks() monitoring
 - healMRFRoutine() channel timeouts
2020-12-16 14:33:05 -08:00
Harshavardhana c606c76323
fix: prioritized latest buckets for crawler to finish the scans faster (#11115)
crawler should only ListBuckets once not for each serverPool,
buckets are same across all pools, across sets and ListBuckets
always returns an unified view, once list buckets returns
sort it by create time to scan the latest buckets earlier
with the assumption that latest buckets would have lesser
content than older buckets allowing them to be scanned faster
and also to be able to provide more closer to latest view.
2020-12-15 17:34:54 -08:00
Harshavardhana 8368ab76aa
fix: remove the requirement for healing buckets in ListBucketsHeal (#11098)
With new refactor of bucket healing, healing bucket happens
automatically including its metadata, there is no need to
redundant heal buckets also in ListBucketsHeal remove
it.
2020-12-14 12:07:07 -08:00
Harshavardhana 2eb52ca5f4
fix: heal bucket metadata right before healing bucket (#11097)
optimization mainly to avoid listing the entire
`.minio.sys/buckets/.minio.sys` directory, this
can get really huge and comes in the way of startup
routines, contents inside `.minio.sys/buckets/.minio.sys`
are rather transient and not necessary to be healed.
2020-12-13 11:57:08 -08:00
Harshavardhana 4550ac6fff
fix: refactor locks to apply them uniquely per node (#11052)
This refactor is done for few reasons below

- to avoid deadlocks in scenarios when number
  of nodes are smaller < actual erasure stripe
  count where in N participating local lockers
  can lead to deadlocks across systems.

- avoids expiry routines to run 1000 of separate
  network operations and routes per disk where
  as each of them are still accessing one single
  local entity.

- it is ideal to have since globalLockServer
  per instance.

- In a 32node deployment however, each server
  group is still concentrated towards the
  same set of lockers that partipicate during
  the write/read phase, unlike previous minio/dsync
  implementation - this potentially avoids send
  32 requests instead we will still send at max
  requests of unique nodes participating in a
  write/read phase.

- reduces overall chattiness on smaller setups.
2020-12-10 07:28:37 -08:00
Harshavardhana ce93b2681b
fix: re-use er.getDisks() properly in certain calls (#11043) 2020-12-07 10:04:07 -08:00
Harshavardhana 9c53cc1b83
fix: heal multiple buckets in bulk (#11029)
makes server startup, orders of magnitude
faster with large number of buckets
2020-12-05 13:00:44 -08:00
Harshavardhana 4ec45753e6 rename server sets to server pools 2020-12-01 13:50:33 -08:00
Harshavardhana bdd094bc39
fix: avoid sending errors on missing objects on locked buckets (#10994)
make sure multi-object delete returned errors that are AWS S3 compatible
2020-11-28 21:15:45 -08:00
Poorna Krishnamoorthy 251c1ef6da Add support for replication of object tags, retention metadata (#10880) 2020-11-19 18:56:09 -08:00
Harshavardhana 1a1f00fa15
fix: use internode data for DisksInfo, VolsInfo in message pack (#10821)
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.

Additionally, remove all unused walkers from the storage layer
2020-11-04 10:10:54 -08:00
Klaus Post 2294e53a0b
Don't retain context in locker (#10515)
Use the context for internal timeouts, but disconnect it from outgoing 
calls so we always receive the results and cancel it remotely.
2020-11-04 08:25:42 -08:00
Harshavardhana 4ea31da889
fix: move list quorum ENV to config (#10804) 2020-11-02 17:21:56 -08:00
Harshavardhana 5412d730c1
simplify monitoring doesn't need to be canceled (#10803)
connect disks monitoring doesn't need to be canceled
upon drive replacement, since we only need to replace
the newly replaced drive.
2020-10-31 14:10:12 -07:00
Harshavardhana b686bb9c83
fix: replaced drive properly by healing the entire drive (#10799)
Bonus fixes, we do not need reload format anymore
as the replaced drive is healed locally we only need
to ensure that drive heal reloads the drive properly.

We preserve the UUID of the original order, this means
that the replacement in `format.json` doesn't mean that
the drive needs to be reloaded into memory anymore.

fixes #10791
2020-10-31 01:34:48 -07:00
Harshavardhana 5b30bbda92
fix: add more protection distribution to match EcIndex (#10772)
allows for more stricter validation in picking up the right
set of disks for reconstruction.
2020-10-28 00:09:15 -07:00
Harshavardhana 029758cb20
fix: retain the previous UUID for newly replaced drives (#10759)
only newly replaced drives get the new `format.json`,
this avoids disks reloading their in-memory reference
format, ensures that drives are online without
reloading the in-memory reference format.

keeping reference format in-tact means UUIDs
never change once they are formatted.
2020-10-26 10:29:29 -07:00
Harshavardhana 6a8c62f9fd
make sure to preserve UUID from reference format (#10748)
reference format should be source of truth
for inconsistent drives which reconnect,
add them back to their original position

remove automatic fix for existing offline
disk uuids
2020-10-24 13:23:08 -07:00
Harshavardhana 734f258878
fix: slow down auto healing more aggressively (#10730)
Bonus fixes

- logging improvements to ensure that we don't use
  `go logger.LogIf` to avoid runtime.Caller missing
  the function name. log where necessary.
- remove unused code at erasure sets
2020-10-22 13:36:24 -07:00
Harshavardhana ad726b49b4
rename zones to serverSets to avoid terminology conflict (#10679)
we are bringing in availability zones, we should avoid
zones as per server expansion concept.
2020-10-15 14:28:50 -07:00
Harshavardhana f1cc16e788
fix: background heal rely on getOnlineDisks() (#10687) 2020-10-15 13:06:23 -07:00
Harshavardhana 71b97fd3ac
fix: connect disks pre-emptively during startup (#10669)
connect disks pre-emptively upon startup, to ensure we have
enough disks are connected at startup rather than wait
for them.

we need to do this to avoid long wait times for server to
be online when we have servers come up in rolling upgrade
fashion
2020-10-13 18:28:42 -07:00
Harshavardhana 2760fc86af
Bump default idleConnsPerHost to control conns in time_wait (#10653)
This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes

- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
  reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
  syn backlogs.

fixes #10413
2020-10-12 14:19:46 -07:00
Harshavardhana 6484453fc6
optionally allow strict quorum listing (#10649)
```
export MINIO_API_LIST_STRICT_QUORUM=on
```

would enable listing in quorum if necessary
2020-10-09 15:40:46 -07:00
Harshavardhana 736e58dd68
fix: handle concurrent lockers with multiple optimizations (#10640)
- select lockers which are non-local and online to have
  affinity towards remote servers for lock contention

- optimize lock retry interval to avoid sending too many
  messages during lock contention, reduces average CPU
  usage as well

- if bucket is not set, when deleteObject fails make sure
  setPutObjHeaders() honors lifecycle only if bucket name
  is set.

- fix top locks to list out always the oldest lockers always,
  avoid getting bogged down into map's unordered nature.
2020-10-08 12:32:32 -07:00
Harshavardhana 66174692a2
add '.healing.bin' for tracking currently healing disk (#10573)
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.

There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
2020-09-28 19:39:32 -07:00
Harshavardhana eafa775952
fix: add lock ownership to expire locks (#10571)
- Add owner information for expiry, locking, unlocking a resource
- TopLocks returns now locks in quorum by default, provides
  a way to capture stale locks as well with `?stale=true`
- Simplify the quorum handling for locks to avoid from storage
  class, because there were challenges to make it consistent
  across all situations.
- And other tiny simplifications to reset locks.
2020-09-25 19:21:52 -07:00
Harshavardhana ca989eb0b3
avoid ListBuckets returning quorum errors when node is down (#10555)
Also, revamp the way ListBuckets work make few portions
of the healing logic parallel

- walk objects for healing disks in parallel
- collect the list of buckets in parallel across drives
- provide consistent view for listBuckets()
2020-09-24 09:53:38 -07:00
Harshavardhana e60834838f
fix: background disk heal, to reload format consistently (#10502)
It was observed in VMware vsphere environment during a
pod replacement, `mc admin info` might report incorrect
offline nodes for the replaced drive. This issue eventually
goes away but requires quite a lot of time for all servers
to be in sync.

This PR fixes this behavior properly.
2020-09-16 21:14:35 -07:00
Harshavardhana 0104af6bcc
delayed locks until we have started reading the body (#10474)
This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.

The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.

https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150

To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.

Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.

We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
2020-09-14 15:57:13 -07:00
Klaus Post 493c714663
Remove erasureSets and erasureObjects from ObjectLayer (#10442) 2020-09-10 09:18:19 -07:00
Harshavardhana 6a0372be6c
cleanup tmpDir any older entries automatically just like multipart (#10439)
also consider multipart uploads, temporary files in `.minio.sys/tmp`
as stale beyond 24hrs and clean them up automatically
2020-09-08 15:55:40 -07:00
Harshavardhana b0e1d4ce78
re-attach offline drive after new drive replacement (#10416)
inconsistent drive healing when one of the drive is offline
while a new drive was replaced, this change is to ensure
that we can add the offline drive back into the mix by
healing it again.
2020-09-04 17:09:02 -07:00
Harshavardhana eb19c8af40
Bump response header timeout for proxying list request (#10420) 2020-09-04 16:07:40 -07:00
Klaus Post 2d58a8d861
Add storage layer contexts (#10321)
Add context to all (non-trivial) calls to the storage layer. 

Contexts are propagated through the REST client.

- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.

The "dangerous" part is that now a caller disconnecting *will* propagate down,  so a 
"delete" operation will now be interrupted. In some cases we might want to disconnect 
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
2020-09-04 09:45:06 -07:00
Harshavardhana a359e36e35
tolerate listing with only readQuorum disks (#10357)
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.

Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.

Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for

- cleanupStaleMultipartUploads()
2020-08-26 19:29:35 -07:00
Harshavardhana d19b434ffc
fix: bring back delayed leaf detection in listing (#10346) 2020-08-25 12:26:48 -07:00
Klaus Post c097ce9c32
continous healing based on crawler (#10103)
Design: https://gist.github.com/klauspost/792fe25c315caf1dd15c8e79df124914
2020-08-24 13:47:01 -07:00
Harshavardhana 95411228db
add missing cleanupStaleMultipartUploads (#10325)
fixes #10319
2020-08-21 21:39:54 -07:00
Harshavardhana 74116204ce
handle fresh setup with mixed drives (#10273)
fresh drive setups when one of the drive is
a root drive, we should ignore such a root
drive and not proceed to format.

This PR handles this properly by marking
the disks which are root disk and they are
taken offline.
2020-08-18 14:37:26 -07:00