minio

Commit Graph

Author	SHA1	Message	Date
yinhen	d300e775a6	Avoid reconnect of disk during startup sequence (#14070 )	2022-01-10 23:33:58 -08:00
Harshavardhana	76b21de0c6	feat: decommission feature for pools (#14012 ) ``` λ mc admin decommission start alias/ http://minio{1...2}/data{1...4} ``` ``` λ mc admin decommission status alias/ ┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────┐ │ ID │ Pools │ Capacity │ Status │ │ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Active │ │ 2nd │ http://minio{3...4}/data{1...4} │ 329 GiB (used) / 421 GiB (total) │ Active │ └─────┴─────────────────────────────────┴──────────────────────────────────┴────────┘ ``` ``` λ mc admin decommission status alias/ http://minio{1...2}/data{1...4} Progress: ===================> [1GiB/sec] [15%] [4TiB/50TiB] Time Remaining: 4 hours (started 3 hours ago) ``` ``` λ mc admin decommission status alias/ http://minio{1...2}/data{1...4} ERROR: This pool is not scheduled for decommissioning currently. ``` ``` λ mc admin decommission cancel alias/ ┌─────┬─────────────────────────────────┬──────────────────────────────────┬──────────┐ │ ID │ Pools │ Capacity │ Status │ │ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining │ └─────┴─────────────────────────────────┴──────────────────────────────────┴──────────┘ ``` > NOTE: Canceled decommission will not make the pool active again, since we might have > Potentially partial duplicate content on the other pools, to avoid this scenario be > very sure to start decommissioning as a planned activity. ``` λ mc admin decommission cancel alias/ http://minio{1...2}/data{1...4} ┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────────────────┐ │ ID │ Pools │ Capacity │ Status │ │ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining(Canceled) │ └─────┴─────────────────────────────────┴──────────────────────────────────┴────────────────────┘ ```	2022-01-10 09:07:49 -08:00
Klaus Post	0e31cff762	fix: DeleteMultipleObjects to finish even if cancelled + concurrent sets (#14038 ) * Process sets concurrently. * Disconnect context from request. * Insert context cancellation checks. * errFileNotFound and errFileVersionNotFound are ok, unless creating delete markers.	2022-01-06 10:47:49 -08:00
Harshavardhana	f527c708f2	run gofumpt cleanup across code-base (#14015 )	2022-01-02 09:15:06 -08:00
Harshavardhana	4545ecad58	ignore swapped drives instead of throwing errors (#13655 ) - add checks such that swapped disks are detected and ignored - never used for normal operations. - implement `unrecognizedDisk` to be ignored with all operations returning `errDiskNotFound`. - also add checks such that we do not load unexpected disks while connecting automatically. - additionally humanize the values when printing the errors. Bonus: fixes handling of non-quorum situations in getLatestFileInfo(), that does not work when 2 drives are down, currently this function would return errors incorrectly.	2021-11-15 09:46:55 -08:00
Harshavardhana	8bb52c9c2a	fix: ignore disks that are available but not writable (#13585 ) This is to allow replacing drives while some drives while available are not writable.	2021-11-04 16:42:49 -07:00
Klaus Post	d9c1d79e30	Protect logger targets (#13529 ) Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`. Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.	2021-10-28 07:35:28 -07:00
Harshavardhana	0c48b1d993	fix: benchmarking test initialization > go test -run=none -bench=Benchmark github.com/minio/minio/cmd Runs now without any crashes. fixes #13380	2021-10-08 11:38:30 -07:00
Klaus Post	421160631a	MakeBucket: Delete leftover buckets on error (#13368 ) In (erasureServerPools).MakeBucketWithLocation deletes the created buckets if any set returns an error. Add `NoRecreate` option, which will not recreate the bucket in `DeleteBucket`, if the operation fails. Additionally use context.Background() for operations we always want to be performed.	2021-10-06 10:24:40 -07:00
Harshavardhana	fabf60bc4c	fix: allow configuring cleanup of stale multipart uploads (#13354 ) allow dynamically changing cleanup of stale multipart uploads, their expiry and how frequently its checked. Improves #13270	2021-10-04 10:52:28 -07:00
Anis Elleuch	1d9e91e00f	Fix wrong reporting of total disks after restart (#13326 ) A restart of the cluster and a failed disk will wrongly count the number of total disks.	2021-09-29 11:36:19 -07:00
Anis Elleuch	68a2d6fc40	xl: Avoid empty endpoints (#13299 ) An endpoint can be empty when a disk is offline or something wrong with it. Avoid it by filling erasureSets.endpointStrings with values from arguments.	2021-09-25 10:51:03 -07:00
Harshavardhana	1a884cd8e1	fix: deleting objects was not working after upgrades (#13242 ) DeleteObject() on existing objects before `xl.json` to `xl.meta` change were not working, not sure when this regression was added. This PR fixes this properly. Also this PR ensures that we perform rename of xl.json to xl.meta only during "write" phase of the call i.e either during Healing or PutObject() overwrites. Also handles few other scenarios during migration where `backendEncryptedFile` was missing deleteConfig() will fail with `configNotFound` this case was not ignored, which can lead to failure during upgrades.	2021-09-17 19:34:48 -07:00
Harshavardhana	6d42569ade	remove ListBucketsMetadata instead add them to AccountInfo() (#13241 )	2021-09-17 15:02:21 -07:00
Harshavardhana	45bcf73185	feat: Add ListBucketsWithMetadata extension API (#13219 )	2021-09-16 09:52:41 -07:00
Harshavardhana	787a72a993	make sure to ignore the rootDisk when healing drives (#13209 ) fixes #13208	2021-09-14 15:10:00 -07:00
Harshavardhana	035882d292	fix: remove parentIsObject() check (#12851 ) we will allow situations such as ``` a/b/1.txt a/b ``` and ``` a/b a/b/1.txt ``` we are going to document that this usecase is not supported and we will never support it, if any application does this users have to delete the top level parent to make sure namespace is accessible at lower level. rest of the situations where the prefixes get created across sets are supported as is.	2021-08-03 13:26:57 -07:00
Anis Elleuch	b0b4696a64	heal: Add MRF metrics to background heal API response (#12398 ) This commit gathers MRF metrics from all nodes in a cluster and return it to the caller. This will show information about the number of objects in the MRF queues waiting to be healed.	2021-07-15 22:32:06 -07:00
Harshavardhana	4669d19f2a	fix: simplify diskMap usage to keep certain checks predictable (#12519 ) Bonus: also make sure that we Sanitize() the drives only during startup of the server, but not during disk reconnects.	2021-06-16 14:26:26 -07:00
Anis Elleuch	7722b91e1d	s3: Force a prefix removal using a special header (#12504 ) An S3 client can send `x-minio-force-delete: true` to remove a prefix.	2021-06-15 18:43:14 -07:00
Harshavardhana	a93aa2eac1	fix: upon failure attempt an undo for all calls in DeleteBucket() (#12480 ) its possible that, version might exist on second pool such that upon deleteBucket() might have deleted the bucket on pool1 successfully since it doesn't have any objects, undo such operations properly in all any error scenario. Also delete bucket metadata from pool layer rather than sets layer.	2021-06-09 17:13:00 -07:00
Anis Elleuch	8e9e028c0c	fix: safe update of the audit objectErasureMap (#12477 ) objectErasureMap in the audit holds information about the objects involved in the current S3 operation such as pool index, set an index, and disk endpoints. One user saw a crash due to a concurrent update of objectErasureMap information. Use sync.Map to prevent a crash.	2021-06-09 10:51:19 -07:00
Harshavardhana	542fe4ea2e	fix: legacy objects with 10MiB blockSize should use right buffers (#12459 ) healing code was using incorrect buffers to heal older objects with 10MiB erasure blockSize, incorrect calculation of such buffers can lead to incorrect premature closure of io.Pipe() during healing. fixes #12410	2021-06-07 10:06:06 -07:00
Harshavardhana	1f262daf6f	rename all remaining packages to internal/ (#12418 ) This is to ensure that there are no projects that try to import `minio/minio/pkg` into their own repo. Any such common packages should go to `https://github.com/minio/pkg`	2021-06-01 14:59:40 -07:00
Harshavardhana	81d5688d56	move the dependency to minio/pkg for common libraries (#12397 )	2021-05-28 15:17:01 -07:00
Anis Elleuch	56d4d7b8b1	MRF: Better detection of non stable disks (#12252 ) MRF does not detect when a node is disconnected and reconnected quickly this change will ensure that MRF is alerted by comparing the last disk reconnection timestamp with the last MRF check time. Signed-off-by: Anis Elleuch <anis@min.io> Co-authored-by: Klaus Post <klauspost@gmail.com>	2021-05-11 09:19:15 -07:00
Harshavardhana	1aa5858543	move madmin to github.com/minio/madmin-go (#12239 )	2021-05-06 08:52:02 -07:00
Krishnan Parthasarathi	c829e3a13b	Support for remote tier management (#12090 ) With this change, MinIO's ILM supports transitioning objects to a remote tier. This change includes support for Azure Blob Storage, AWS S3 compatible object storage incl. MinIO and Google Cloud Storage as remote tier storage backends. Some new additions include: - Admin APIs remote tier configuration management - Simple journal to track remote objects to be 'collected' This is used by object API handlers which 'mutate' object versions by overwriting/replacing content (Put/CopyObject) or removing the version itself (e.g DeleteObjectVersion). - Rework of previous ILM transition to fit the new model In the new model, a storage class (a.k.a remote tier) is defined by the 'remote' object storage type (one of s3, azure, GCS), bucket name and a prefix. * Fixed bugs, review comments, and more unit-tests - Leverage inline small object feature - Migrate legacy objects to the latest object format before transitioning - Fix restore to particular version if specified - Extend SharedDataDirCount to handle transitioned and restored objects - Restore-object should accept version-id for version-suspended bucket (#12091) - Check if remote tier creds have sufficient permissions - Bonus minor fixes to existing error messages Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io> Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	e85b28398b	fix: pre-allocate certain slices with expected capacity (#12044 ) Avoids append() based tiny allocations on known allocated slices repeated access.	2021-04-12 13:45:06 -07:00
Klaus Post	111c02770e	Fix data race when connecting disks (#11983 ) Multiple disks from the same set would be writing concurrently. ``` WARNING: DATA RACE Write at 0x00c002100ce0 by goroutine 166: github.com/minio/minio/cmd.(erasureSets).connectDisks.func1() d:/minio/minio/cmd/erasure-sets.go:254 +0x82f Previous write at 0x00c002100ce0 by goroutine 129: github.com/minio/minio/cmd.(erasureSets).connectDisks.func1() d:/minio/minio/cmd/erasure-sets.go:254 +0x82f Goroutine 166 (running) created at: github.com/minio/minio/cmd.(erasureSets).connectDisks() d:/minio/minio/cmd/erasure-sets.go:210 +0x324 github.com/minio/minio/cmd.(erasureSets).monitorAndConnectEndpoints() d:/minio/minio/cmd/erasure-sets.go:288 +0x244 Goroutine 129 (finished) created at: github.com/minio/minio/cmd.(erasureSets).connectDisks() d:/minio/minio/cmd/erasure-sets.go:210 +0x324 github.com/minio/minio/cmd.(erasureSets).monitorAndConnectEndpoints() d:/minio/minio/cmd/erasure-sets.go:288 +0x244 ```	2021-04-06 11:33:10 -07:00
Harshavardhana	d46386246f	api: Introduce metadata update APIs to update only metadata (#11962 ) Current implementation heavily relies on readAllFileInfo but with the advent of xl.meta inlined with data, we cannot easily avoid reading data when we are only interested is updating metadata, this leads to invariably write amplification during metadata updates, repeatedly reading data when we are only interested in updating metadata. This PR ensures that we implement a metadata only update API at storage layer, that handles updates to metadata alone for any given version - given the version is valid and present. This helps reduce the chattiness for following calls.. - PutObjectTags - DeleteObjectTags - PutObjectLegalHold - PutObjectRetention - ReplicateObject (updates metadata on replication status)	2021-04-04 13:32:31 -07:00
Anis Elleuch	2c296652f7	Simplify access to local node name (#11907 ) The local node name is heavily used in tracing, create a new global variable to store it. Multiple goroutines can access it since it won't be changed later.	2021-03-26 11:37:58 -07:00
Harshavardhana	90d8ec6310	fix: reject duplicate keys in PostPolicyJSON document (#11902 ) fixes #11894	2021-03-25 13:57:57 -07:00
Anis Elleuch	14d89eaae4	mrf: Enhance behavior for better results (#11788 ) MRF was starting to heal when it receives a disk connection event, which is not good when a node having multiple disks reconnects to the cluster. Besides, MRF needs Remove healing option to remove stale files.	2021-03-18 11:19:02 -07:00
Harshavardhana	add3cd4e44	allow configuring delete cleanup interval from default 10minutes (#11818 )	2021-03-17 15:15:58 -07:00
Anis Elleuch	57f3ed22d4	erasure: Reduce the interval of cleaning up .trash folder (#11741 ) Reduce from 30 to 10 minutes.	2021-03-09 09:45:38 -08:00
Harshavardhana	9ccc483df6	[feat]: change erasure coding default block size from 10MiB to 1MiB (#11721 ) major performance improvements in range GETs to avoid large read amplification when ranges are tiny and random ``` ------------------- Operation: GET Operations: 142014 -> 339421 Duration: 4m50s -> 4m56s * Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s * Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s * 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s * Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s ``` TTFB from 10MiB BlockSize ``` * First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s ``` TTFB from 1MiB BlockSize ``` * First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms ``` Full object reads however do see a slight change which won't be noticeable in real world, so not doing any comparisons TTFB still had improvements with full object reads with 1MiB ``` * First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s ``` v/s TTFB with 10MiB ``` * First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s ``` This change should affect all new uploads, previous uploads should continue to work with business as usual. But dramatic improvements can be seen with these changes.	2021-03-06 14:09:34 -08:00
Harshavardhana	d971061305	use listPathRaw for HealObjects() instead of expensive WalkVersions() (#11675 )	2021-03-06 09:25:48 -08:00
Klaus Post	fa9cf1251b	Imporve healing and reporting (#11312 ) * Provide information on actively healing, buckets healed/queued, objects healed/failed. * Add concurrent healing of multiple sets (typically on startup). * Add bucket level resume, so restarts will only heal non-healed buckets. * Print summary after healing a disk is done.	2021-03-04 14:36:23 -08:00
Harshavardhana	c6a120df0e	fix: Prometheus metrics to re-use storage disks (#11647 ) also re-use storage disks for all `mc admin server info` calls as well, implement a new LocalStorageInfo() API call at ObjectLayer to lookup local disks storageInfo also fixes bugs where there were double calls to StorageInfo()	2021-03-02 17:28:04 -08:00
Harshavardhana	b690304eed	use faster way for siphash (#11640 )	2021-02-26 16:53:06 -08:00
Harshavardhana	6386b45c08	[feat] use rename instead of recursive deletes (#11641 ) most of the delete calls today spend time in a blocking operation where multiple calls need to be recursively sent to delete the objects, instead we can use rename operation to atomically move the objects from the namespace to `tmp/.trash` we can schedule deletion of objects at this location once in 15, 30mins and we can also add wait times between each delete operation. this allows us to make delete's faster as well less chattier on the drives, each server runs locally a groutine which would clean this up regularly.	2021-02-26 09:52:27 -08:00
Andreas Auernhammer	1f659204a2	remove GetObject from ObjectLayer interface (#11635 ) This commit removes the `GetObject` method from the `ObjectLayer` interface. The `GetObject` method is not longer used by the HTTP handlers implementing the high-level S3 semantics. Instead, they use the `GetObjectNInfo` method which returns both, an object handle as well as the object metadata. Therefore, it is no longer necessary that a concrete `ObjectLayer` implements `GetObject`.	2021-02-26 09:52:02 -08:00
Harshavardhana	a8e4f64ff3	Revert "fix: remove persistence layer for metacache store in memory (#11538 )" This reverts commit `b23659927c`.	2021-02-24 22:24:51 -08:00
Harshavardhana	b23659927c	fix: remove persistence layer for metacache store in memory (#11538 ) store the cache in-memory instead of disks to avoid large write amplifications for list heavy workloads, store in memory instead and let it auto expire.	2021-02-24 15:51:41 -08:00
Harshavardhana	aa7244a9a4	fix: make sure to convert the error properly in HealBucket() (#11610 ) server startup code expects the object layer to properly convert error into a proper type, so that in situations when servers are coming up and quorum is not available servers wait on each other.	2021-02-23 09:23:11 -08:00
Klaus Post	8a6b13c239	Avoid synchronizing usage writes (#11560 ) If the periodic `case <-t.C:` save gets held up for a long time it will end up synchronize all disk writes for saving the caches. We add jitter to per set writes so they don't sync up and don't hold a lock for the write, since it isn't needed anyway. If an outage prevents writes for a long while we also add individual waits for each disk in case there was a queue. Furthermore limit the number of buffers kept to 2GiB, since this could get huge in large clusters. This will not act as a hard limit but should be enough for normal operation.	2021-02-18 00:38:37 -08:00
Krishnan Parthasarathi	b87fae0049	Simplify PutObjReader for plain-text reader usage (#11470 ) This change moves away from a unified constructor for plaintext and encrypted usage. NewPutObjReader is simplified for the plain-text reader use. For encrypted reader use, WithEncryption should be called on an initialized PutObjReader. Plaintext: func NewPutObjReader(rawReader hash.Reader) PutObjReader The hash.Reader is used to provide payload size and md5sum to the downstream consumers. This is different from the previous version in that there is no need to pass nil values for unused parameters. Encrypted: func WithEncryption(encReader hash.Reader, key crypto.ObjectKey) (*PutObjReader, error) This method sets up encrypted reader along with the key to seal the md5sum produced by the plain-text reader (already setup when NewPutObjReader was called). Usage: ``` pReader := NewPutObjReader(rawReader) // ... other object handler code goes here // Prepare the encrypted hashed reader pReader, err = pReader.WithEncryption(encReader, objEncKey) ```	2021-02-10 08:52:50 -08:00
Harshavardhana	88c1bb0720	fix: improper ticker usage in goroutines (#11468 ) - lock maintenance loop was incorrectly sleeping as well as using ticker badly, leading to extra expiration routines getting triggered that could flood the network. - multipart upload cleanup should be based on timer instead of ticker, to ensure that long running jobs don't get triggered twice. - make sure to get right lockers for object name	2021-02-05 19:23:48 -08:00
Anis Elleuch	e96fdcd5ec	tagging: Add event notif for PUT object tagging (#11366 ) An optimization to avoid double calling for during PutObject tagging	2021-02-01 13:52:51 -08:00
Harshavardhana	1e53bf2789	fix: allow expansion with newer constraints for older setups (#11372 ) currently we had a restriction where older setups would need to follow previous style of "stripe" count being same expansion, we can relax that instead newer pools can be expanded for older setups with newer constraints of common parity ratio.	2021-01-29 11:40:55 -08:00
Harshavardhana	6717295e18	fix: rename audit log docs and datastructure	2021-01-26 13:39:55 -08:00
Anis Elleuch	00cff1aac5	audit: per object send pool number, set number and servers per operation (#11233 )	2021-01-26 13:21:51 -08:00
Harshavardhana	9cdd981ce7	fix: expire locks only on participating lockers (#11335 ) additionally also add a new ForceUnlock API, to allow forcibly unlocking locks if possible.	2021-01-25 10:01:27 -08:00
Klaus Post	dac19d7272	Clarify root disk error (#11314 ) Make it clearer what the problem is and how to resolve it.	2021-01-20 13:11:42 -08:00
Harshavardhana	3ca6330661	fix: optimize parentDirIsObject by moving isObject to storage layer (#11291 ) For objects with `N` prefix depth, this PR reduces `N` such network operations by converting `CheckFile` into a single bulk operation. Reduction in chattiness here would allow disks to be utilized more cleanly, while maintaining the same functionality along with one extra volume check stat() call is removed. Update tests to test multiple sets scenario	2021-01-18 12:25:22 -08:00
Harshavardhana	4315f93421	fix: make sure parentDirIsObject is used at set level (#11280 ) parentDirIsObject is not using set level understanding to check for parent objects, without this it can lead to objects that can actually reside on a separate set as objects and would conflict.	2021-01-17 01:11:48 -08:00
Harshavardhana	f903cae6ff	Support variable server pools (#11256 ) Current implementation requires server pools to have same erasure stripe sizes, to facilitate same SLA and expectations. This PR allows server pools to be variadic, i.e they do not have to be same erasure stripe sizes - instead they should have SLA for parity ratio. If the parity ratio cannot be guaranteed by the new server pool, the deployment is rejected i.e server pool expansion is not allowed.	2021-01-16 12:08:02 -08:00
Harshavardhana	e7ae49f9c9	fix: calculate prometheus disks_offline/disks_total correctly (#11215 ) fixes #11196	2021-01-04 09:42:09 -08:00
Anis Elleuch	2ecaab55a6	admin: ServerInfo returns info without object layer initialized (#11142 )	2020-12-21 09:35:19 -08:00
Harshavardhana	f714840da7	add _MINIO_SERVER_DEBUG env for enabling debug messages (#11128 )	2020-12-17 16:52:47 -08:00
Harshavardhana	7c9ef76f66	fix: timer deadlock on expired timers (#11124 ) issue was introduced in #11106 the following pattern <-t.C // timer fired if !t.Stop() { <-t.C // timer hangs } Seems to hang at the last `t.C` line, this issue happens because a fired timer cannot be Stopped() anymore and t.Stop() returns `false` leading to confusing state of usage. Refactor the code such that use timers appropriately with exact requirements in place.	2020-12-17 12:35:02 -08:00
Harshavardhana	b390a2a0b9	fix: reuser timers in erasure set hotpaths (#11106 ) reuser timers in - connectDisks() monitoring - healMRFRoutine() channel timeouts	2020-12-16 14:33:05 -08:00
Harshavardhana	c606c76323	fix: prioritized latest buckets for crawler to finish the scans faster (#11115 ) crawler should only ListBuckets once not for each serverPool, buckets are same across all pools, across sets and ListBuckets always returns an unified view, once list buckets returns sort it by create time to scan the latest buckets earlier with the assumption that latest buckets would have lesser content than older buckets allowing them to be scanned faster and also to be able to provide more closer to latest view.	2020-12-15 17:34:54 -08:00
Harshavardhana	8368ab76aa	fix: remove the requirement for healing buckets in ListBucketsHeal (#11098 ) With new refactor of bucket healing, healing bucket happens automatically including its metadata, there is no need to redundant heal buckets also in ListBucketsHeal remove it.	2020-12-14 12:07:07 -08:00
Harshavardhana	2eb52ca5f4	fix: heal bucket metadata right before healing bucket (#11097 ) optimization mainly to avoid listing the entire `.minio.sys/buckets/.minio.sys` directory, this can get really huge and comes in the way of startup routines, contents inside `.minio.sys/buckets/.minio.sys` are rather transient and not necessary to be healed.	2020-12-13 11:57:08 -08:00
Harshavardhana	4550ac6fff	fix: refactor locks to apply them uniquely per node (#11052 ) This refactor is done for few reasons below - to avoid deadlocks in scenarios when number of nodes are smaller < actual erasure stripe count where in N participating local lockers can lead to deadlocks across systems. - avoids expiry routines to run 1000 of separate network operations and routes per disk where as each of them are still accessing one single local entity. - it is ideal to have since globalLockServer per instance. - In a 32node deployment however, each server group is still concentrated towards the same set of lockers that partipicate during the write/read phase, unlike previous minio/dsync implementation - this potentially avoids send 32 requests instead we will still send at max requests of unique nodes participating in a write/read phase. - reduces overall chattiness on smaller setups.	2020-12-10 07:28:37 -08:00
Harshavardhana	ce93b2681b	fix: re-use er.getDisks() properly in certain calls (#11043 )	2020-12-07 10:04:07 -08:00
Harshavardhana	9c53cc1b83	fix: heal multiple buckets in bulk (#11029 ) makes server startup, orders of magnitude faster with large number of buckets	2020-12-05 13:00:44 -08:00
Harshavardhana	4ec45753e6	rename server sets to server pools	2020-12-01 13:50:33 -08:00
Harshavardhana	bdd094bc39	fix: avoid sending errors on missing objects on locked buckets (#10994 ) make sure multi-object delete returned errors that are AWS S3 compatible	2020-11-28 21:15:45 -08:00
Poorna Krishnamoorthy	251c1ef6da	Add support for replication of object tags, retention metadata (#10880 )	2020-11-19 18:56:09 -08:00
Harshavardhana	1a1f00fa15	fix: use internode data for DisksInfo, VolsInfo in message pack (#10821 ) Similar to #10775 for fewer memory allocations, since we use getOnlineDisks() extensively for listing we should optimize it further. Additionally, remove all unused walkers from the storage layer	2020-11-04 10:10:54 -08:00
Klaus Post	2294e53a0b	Don't retain context in locker (#10515 ) Use the context for internal timeouts, but disconnect it from outgoing calls so we always receive the results and cancel it remotely.	2020-11-04 08:25:42 -08:00
Harshavardhana	4ea31da889	fix: move list quorum ENV to config (#10804 )	2020-11-02 17:21:56 -08:00
Harshavardhana	5412d730c1	simplify monitoring doesn't need to be canceled (#10803 ) connect disks monitoring doesn't need to be canceled upon drive replacement, since we only need to replace the newly replaced drive.	2020-10-31 14:10:12 -07:00
Harshavardhana	b686bb9c83	fix: replaced drive properly by healing the entire drive (#10799 ) Bonus fixes, we do not need reload format anymore as the replaced drive is healed locally we only need to ensure that drive heal reloads the drive properly. We preserve the UUID of the original order, this means that the replacement in `format.json` doesn't mean that the drive needs to be reloaded into memory anymore. fixes #10791	2020-10-31 01:34:48 -07:00
Harshavardhana	5b30bbda92	fix: add more protection distribution to match EcIndex (#10772 ) allows for more stricter validation in picking up the right set of disks for reconstruction.	2020-10-28 00:09:15 -07:00
Harshavardhana	029758cb20	fix: retain the previous UUID for newly replaced drives (#10759 ) only newly replaced drives get the new `format.json`, this avoids disks reloading their in-memory reference format, ensures that drives are online without reloading the in-memory reference format. keeping reference format in-tact means UUIDs never change once they are formatted.	2020-10-26 10:29:29 -07:00
Harshavardhana	6a8c62f9fd	make sure to preserve UUID from reference format (#10748 ) reference format should be source of truth for inconsistent drives which reconnect, add them back to their original position remove automatic fix for existing offline disk uuids	2020-10-24 13:23:08 -07:00
Harshavardhana	734f258878	fix: slow down auto healing more aggressively (#10730 ) Bonus fixes - logging improvements to ensure that we don't use `go logger.LogIf` to avoid runtime.Caller missing the function name. log where necessary. - remove unused code at erasure sets	2020-10-22 13:36:24 -07:00
Harshavardhana	ad726b49b4	rename zones to serverSets to avoid terminology conflict (#10679 ) we are bringing in availability zones, we should avoid zones as per server expansion concept.	2020-10-15 14:28:50 -07:00
Harshavardhana	f1cc16e788	fix: background heal rely on getOnlineDisks() (#10687 )	2020-10-15 13:06:23 -07:00
Harshavardhana	71b97fd3ac	fix: connect disks pre-emptively during startup (#10669 ) connect disks pre-emptively upon startup, to ensure we have enough disks are connected at startup rather than wait for them. we need to do this to avoid long wait times for server to be online when we have servers come up in rolling upgrade fashion	2020-10-13 18:28:42 -07:00
Harshavardhana	2760fc86af	Bump default idleConnsPerHost to control conns in time_wait (#10653 ) This PR fixes a hang which occurs quite commonly at higher concurrency by allowing following changes - allowing lower connections in time_wait allows faster socket open's - lower idle connection timeout to ensure that we let kernel reclaim the time_wait connections quickly - increase somaxconn to 4096 instead of 2048 to allow larger tcp syn backlogs. fixes #10413	2020-10-12 14:19:46 -07:00
Harshavardhana	6484453fc6	optionally allow strict quorum listing (#10649 ) ``` export MINIO_API_LIST_STRICT_QUORUM=on ``` would enable listing in quorum if necessary	2020-10-09 15:40:46 -07:00
Harshavardhana	736e58dd68	fix: handle concurrent lockers with multiple optimizations (#10640 ) - select lockers which are non-local and online to have affinity towards remote servers for lock contention - optimize lock retry interval to avoid sending too many messages during lock contention, reduces average CPU usage as well - if bucket is not set, when deleteObject fails make sure setPutObjHeaders() honors lifecycle only if bucket name is set. - fix top locks to list out always the oldest lockers always, avoid getting bogged down into map's unordered nature.	2020-10-08 12:32:32 -07:00
Harshavardhana	66174692a2	add '.healing.bin' for tracking currently healing disk (#10573 ) add a hint on the disk to allow for tracking fresh disk being healed, to allow for restartable heals, and also use this as a way to track and remove disks. There are more pending changes where we should move all the disk formatting logic to backend drives, this PR doesn't deal with this refactor instead makes it easier to track healing in the future.	2020-09-28 19:39:32 -07:00
Harshavardhana	eafa775952	fix: add lock ownership to expire locks (#10571 ) - Add owner information for expiry, locking, unlocking a resource - TopLocks returns now locks in quorum by default, provides a way to capture stale locks as well with `?stale=true` - Simplify the quorum handling for locks to avoid from storage class, because there were challenges to make it consistent across all situations. - And other tiny simplifications to reset locks.	2020-09-25 19:21:52 -07:00
Harshavardhana	ca989eb0b3	avoid ListBuckets returning quorum errors when node is down (#10555 ) Also, revamp the way ListBuckets work make few portions of the healing logic parallel - walk objects for healing disks in parallel - collect the list of buckets in parallel across drives - provide consistent view for listBuckets()	2020-09-24 09:53:38 -07:00
Harshavardhana	e60834838f	fix: background disk heal, to reload format consistently (#10502 ) It was observed in VMware vsphere environment during a pod replacement, `mc admin info` might report incorrect offline nodes for the replaced drive. This issue eventually goes away but requires quite a lot of time for all servers to be in sync. This PR fixes this behavior properly.	2020-09-16 21:14:35 -07:00
Harshavardhana	0104af6bcc	delayed locks until we have started reading the body (#10474 ) This is to ensure that Go contexts work properly, after some interesting experiments I found that Go net/http doesn't cancel the context when Body is non-zero and hasn't been read till EOF. The following gist explains this, this can lead to pile up of go-routines on the server which will never be canceled and will die at a really later point in time, which can simply overwhelm the server. https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150 To avoid this refactor the locking such that we take locks after we have started reading from the body and only take locks when needed. Also, remove contextReader as it's not useful, doesn't work as expected context is not canceled until the body reaches EOF so there is no point in wrapping it with context and putting a `select {` on it which can unnecessarily increase the CPU overhead. We will still use the context to cancel the lockers etc. Additional simplification in the locker code to avoid timers as re-using them is a complicated ordeal avoid them in the hot path, since locking is very common this may avoid lots of allocations.	2020-09-14 15:57:13 -07:00
Klaus Post	493c714663	Remove erasureSets and erasureObjects from ObjectLayer (#10442 )	2020-09-10 09:18:19 -07:00
Harshavardhana	6a0372be6c	cleanup tmpDir any older entries automatically just like multipart (#10439 ) also consider multipart uploads, temporary files in `.minio.sys/tmp` as stale beyond 24hrs and clean them up automatically	2020-09-08 15:55:40 -07:00
Harshavardhana	b0e1d4ce78	re-attach offline drive after new drive replacement (#10416 ) inconsistent drive healing when one of the drive is offline while a new drive was replaced, this change is to ensure that we can add the offline drive back into the mix by healing it again.	2020-09-04 17:09:02 -07:00
Harshavardhana	eb19c8af40	Bump response header timeout for proxying list request (#10420 )	2020-09-04 16:07:40 -07:00
Klaus Post	2d58a8d861	Add storage layer contexts (#10321 ) Add context to all (non-trivial) calls to the storage layer. Contexts are propagated through the REST client. - `context.TODO()` is left in place for the places where it needs to be added to the caller. - `endWalkCh` could probably be removed from the walkers, but no changes so far. The "dangerous" part is that now a caller disconnecting will propagate down, so a "delete" operation will now be interrupted. In some cases we might want to disconnect this functionality so the operation completes if it has started, leaving the system in a cleaner state.	2020-09-04 09:45:06 -07:00
Harshavardhana	a359e36e35	tolerate listing with only readQuorum disks (#10357 ) We can reduce this further in the future, but this is a good value to keep around. With the advent of continuous healing, we can be assured that namespace will eventually be consistent so we are okay to avoid the necessity to a list across all drives on all sets. Bonus Pop()'s in parallel seem to have the potential to wait too on large drive setups and cause more slowness instead of gaining any performance remove it for now. Also, implement load balanced reply for local disks, ensuring that local disks have an affinity for - cleanupStaleMultipartUploads()	2020-08-26 19:29:35 -07:00
Harshavardhana	d19b434ffc	fix: bring back delayed leaf detection in listing (#10346 )	2020-08-25 12:26:48 -07:00

1 2 3 4

170 Commits