minio

Commit Graph

Author	SHA1	Message	Date
Harshavardhana	b690304eed	use faster way for siphash (#11640 )	2021-02-26 16:53:06 -08:00
Harshavardhana	6386b45c08	[feat] use rename instead of recursive deletes (#11641 ) most of the delete calls today spend time in a blocking operation where multiple calls need to be recursively sent to delete the objects, instead we can use rename operation to atomically move the objects from the namespace to `tmp/.trash` we can schedule deletion of objects at this location once in 15, 30mins and we can also add wait times between each delete operation. this allows us to make delete's faster as well less chattier on the drives, each server runs locally a groutine which would clean this up regularly.	2021-02-26 09:52:27 -08:00
Andreas Auernhammer	1f659204a2	remove GetObject from ObjectLayer interface (#11635 ) This commit removes the `GetObject` method from the `ObjectLayer` interface. The `GetObject` method is not longer used by the HTTP handlers implementing the high-level S3 semantics. Instead, they use the `GetObjectNInfo` method which returns both, an object handle as well as the object metadata. Therefore, it is no longer necessary that a concrete `ObjectLayer` implements `GetObject`.	2021-02-26 09:52:02 -08:00
Harshavardhana	a8e4f64ff3	Revert "fix: remove persistence layer for metacache store in memory (#11538 )" This reverts commit `b23659927c`.	2021-02-24 22:24:51 -08:00
Harshavardhana	b23659927c	fix: remove persistence layer for metacache store in memory (#11538 ) store the cache in-memory instead of disks to avoid large write amplifications for list heavy workloads, store in memory instead and let it auto expire.	2021-02-24 15:51:41 -08:00
Harshavardhana	aa7244a9a4	fix: make sure to convert the error properly in HealBucket() (#11610 ) server startup code expects the object layer to properly convert error into a proper type, so that in situations when servers are coming up and quorum is not available servers wait on each other.	2021-02-23 09:23:11 -08:00
Klaus Post	8a6b13c239	Avoid synchronizing usage writes (#11560 ) If the periodic `case <-t.C:` save gets held up for a long time it will end up synchronize all disk writes for saving the caches. We add jitter to per set writes so they don't sync up and don't hold a lock for the write, since it isn't needed anyway. If an outage prevents writes for a long while we also add individual waits for each disk in case there was a queue. Furthermore limit the number of buffers kept to 2GiB, since this could get huge in large clusters. This will not act as a hard limit but should be enough for normal operation.	2021-02-18 00:38:37 -08:00
Krishnan Parthasarathi	b87fae0049	Simplify PutObjReader for plain-text reader usage (#11470 ) This change moves away from a unified constructor for plaintext and encrypted usage. NewPutObjReader is simplified for the plain-text reader use. For encrypted reader use, WithEncryption should be called on an initialized PutObjReader. Plaintext: func NewPutObjReader(rawReader hash.Reader) PutObjReader The hash.Reader is used to provide payload size and md5sum to the downstream consumers. This is different from the previous version in that there is no need to pass nil values for unused parameters. Encrypted: func WithEncryption(encReader hash.Reader, key crypto.ObjectKey) (*PutObjReader, error) This method sets up encrypted reader along with the key to seal the md5sum produced by the plain-text reader (already setup when NewPutObjReader was called). Usage: ``` pReader := NewPutObjReader(rawReader) // ... other object handler code goes here // Prepare the encrypted hashed reader pReader, err = pReader.WithEncryption(encReader, objEncKey) ```	2021-02-10 08:52:50 -08:00
Harshavardhana	88c1bb0720	fix: improper ticker usage in goroutines (#11468 ) - lock maintenance loop was incorrectly sleeping as well as using ticker badly, leading to extra expiration routines getting triggered that could flood the network. - multipart upload cleanup should be based on timer instead of ticker, to ensure that long running jobs don't get triggered twice. - make sure to get right lockers for object name	2021-02-05 19:23:48 -08:00
Anis Elleuch	e96fdcd5ec	tagging: Add event notif for PUT object tagging (#11366 ) An optimization to avoid double calling for during PutObject tagging	2021-02-01 13:52:51 -08:00
Harshavardhana	1e53bf2789	fix: allow expansion with newer constraints for older setups (#11372 ) currently we had a restriction where older setups would need to follow previous style of "stripe" count being same expansion, we can relax that instead newer pools can be expanded for older setups with newer constraints of common parity ratio.	2021-01-29 11:40:55 -08:00
Harshavardhana	6717295e18	fix: rename audit log docs and datastructure	2021-01-26 13:39:55 -08:00
Anis Elleuch	00cff1aac5	audit: per object send pool number, set number and servers per operation (#11233 )	2021-01-26 13:21:51 -08:00
Harshavardhana	9cdd981ce7	fix: expire locks only on participating lockers (#11335 ) additionally also add a new ForceUnlock API, to allow forcibly unlocking locks if possible.	2021-01-25 10:01:27 -08:00
Klaus Post	dac19d7272	Clarify root disk error (#11314 ) Make it clearer what the problem is and how to resolve it.	2021-01-20 13:11:42 -08:00
Harshavardhana	3ca6330661	fix: optimize parentDirIsObject by moving isObject to storage layer (#11291 ) For objects with `N` prefix depth, this PR reduces `N` such network operations by converting `CheckFile` into a single bulk operation. Reduction in chattiness here would allow disks to be utilized more cleanly, while maintaining the same functionality along with one extra volume check stat() call is removed. Update tests to test multiple sets scenario	2021-01-18 12:25:22 -08:00
Harshavardhana	4315f93421	fix: make sure parentDirIsObject is used at set level (#11280 ) parentDirIsObject is not using set level understanding to check for parent objects, without this it can lead to objects that can actually reside on a separate set as objects and would conflict.	2021-01-17 01:11:48 -08:00
Harshavardhana	f903cae6ff	Support variable server pools (#11256 ) Current implementation requires server pools to have same erasure stripe sizes, to facilitate same SLA and expectations. This PR allows server pools to be variadic, i.e they do not have to be same erasure stripe sizes - instead they should have SLA for parity ratio. If the parity ratio cannot be guaranteed by the new server pool, the deployment is rejected i.e server pool expansion is not allowed.	2021-01-16 12:08:02 -08:00
Harshavardhana	e7ae49f9c9	fix: calculate prometheus disks_offline/disks_total correctly (#11215 ) fixes #11196	2021-01-04 09:42:09 -08:00
Anis Elleuch	2ecaab55a6	admin: ServerInfo returns info without object layer initialized (#11142 )	2020-12-21 09:35:19 -08:00
Harshavardhana	f714840da7	add _MINIO_SERVER_DEBUG env for enabling debug messages (#11128 )	2020-12-17 16:52:47 -08:00
Harshavardhana	7c9ef76f66	fix: timer deadlock on expired timers (#11124 ) issue was introduced in #11106 the following pattern <-t.C // timer fired if !t.Stop() { <-t.C // timer hangs } Seems to hang at the last `t.C` line, this issue happens because a fired timer cannot be Stopped() anymore and t.Stop() returns `false` leading to confusing state of usage. Refactor the code such that use timers appropriately with exact requirements in place.	2020-12-17 12:35:02 -08:00
Harshavardhana	b390a2a0b9	fix: reuser timers in erasure set hotpaths (#11106 ) reuser timers in - connectDisks() monitoring - healMRFRoutine() channel timeouts	2020-12-16 14:33:05 -08:00
Harshavardhana	c606c76323	fix: prioritized latest buckets for crawler to finish the scans faster (#11115 ) crawler should only ListBuckets once not for each serverPool, buckets are same across all pools, across sets and ListBuckets always returns an unified view, once list buckets returns sort it by create time to scan the latest buckets earlier with the assumption that latest buckets would have lesser content than older buckets allowing them to be scanned faster and also to be able to provide more closer to latest view.	2020-12-15 17:34:54 -08:00
Harshavardhana	8368ab76aa	fix: remove the requirement for healing buckets in ListBucketsHeal (#11098 ) With new refactor of bucket healing, healing bucket happens automatically including its metadata, there is no need to redundant heal buckets also in ListBucketsHeal remove it.	2020-12-14 12:07:07 -08:00
Harshavardhana	2eb52ca5f4	fix: heal bucket metadata right before healing bucket (#11097 ) optimization mainly to avoid listing the entire `.minio.sys/buckets/.minio.sys` directory, this can get really huge and comes in the way of startup routines, contents inside `.minio.sys/buckets/.minio.sys` are rather transient and not necessary to be healed.	2020-12-13 11:57:08 -08:00
Harshavardhana	4550ac6fff	fix: refactor locks to apply them uniquely per node (#11052 ) This refactor is done for few reasons below - to avoid deadlocks in scenarios when number of nodes are smaller < actual erasure stripe count where in N participating local lockers can lead to deadlocks across systems. - avoids expiry routines to run 1000 of separate network operations and routes per disk where as each of them are still accessing one single local entity. - it is ideal to have since globalLockServer per instance. - In a 32node deployment however, each server group is still concentrated towards the same set of lockers that partipicate during the write/read phase, unlike previous minio/dsync implementation - this potentially avoids send 32 requests instead we will still send at max requests of unique nodes participating in a write/read phase. - reduces overall chattiness on smaller setups.	2020-12-10 07:28:37 -08:00
Harshavardhana	ce93b2681b	fix: re-use er.getDisks() properly in certain calls (#11043 )	2020-12-07 10:04:07 -08:00
Harshavardhana	9c53cc1b83	fix: heal multiple buckets in bulk (#11029 ) makes server startup, orders of magnitude faster with large number of buckets	2020-12-05 13:00:44 -08:00
Harshavardhana	4ec45753e6	rename server sets to server pools	2020-12-01 13:50:33 -08:00
Harshavardhana	bdd094bc39	fix: avoid sending errors on missing objects on locked buckets (#10994 ) make sure multi-object delete returned errors that are AWS S3 compatible	2020-11-28 21:15:45 -08:00
Poorna Krishnamoorthy	251c1ef6da	Add support for replication of object tags, retention metadata (#10880 )	2020-11-19 18:56:09 -08:00
Harshavardhana	1a1f00fa15	fix: use internode data for DisksInfo, VolsInfo in message pack (#10821 ) Similar to #10775 for fewer memory allocations, since we use getOnlineDisks() extensively for listing we should optimize it further. Additionally, remove all unused walkers from the storage layer	2020-11-04 10:10:54 -08:00
Klaus Post	2294e53a0b	Don't retain context in locker (#10515 ) Use the context for internal timeouts, but disconnect it from outgoing calls so we always receive the results and cancel it remotely.	2020-11-04 08:25:42 -08:00
Harshavardhana	4ea31da889	fix: move list quorum ENV to config (#10804 )	2020-11-02 17:21:56 -08:00
Harshavardhana	5412d730c1	simplify monitoring doesn't need to be canceled (#10803 ) connect disks monitoring doesn't need to be canceled upon drive replacement, since we only need to replace the newly replaced drive.	2020-10-31 14:10:12 -07:00
Harshavardhana	b686bb9c83	fix: replaced drive properly by healing the entire drive (#10799 ) Bonus fixes, we do not need reload format anymore as the replaced drive is healed locally we only need to ensure that drive heal reloads the drive properly. We preserve the UUID of the original order, this means that the replacement in `format.json` doesn't mean that the drive needs to be reloaded into memory anymore. fixes #10791	2020-10-31 01:34:48 -07:00
Harshavardhana	5b30bbda92	fix: add more protection distribution to match EcIndex (#10772 ) allows for more stricter validation in picking up the right set of disks for reconstruction.	2020-10-28 00:09:15 -07:00
Harshavardhana	029758cb20	fix: retain the previous UUID for newly replaced drives (#10759 ) only newly replaced drives get the new `format.json`, this avoids disks reloading their in-memory reference format, ensures that drives are online without reloading the in-memory reference format. keeping reference format in-tact means UUIDs never change once they are formatted.	2020-10-26 10:29:29 -07:00
Harshavardhana	6a8c62f9fd	make sure to preserve UUID from reference format (#10748 ) reference format should be source of truth for inconsistent drives which reconnect, add them back to their original position remove automatic fix for existing offline disk uuids	2020-10-24 13:23:08 -07:00
Harshavardhana	734f258878	fix: slow down auto healing more aggressively (#10730 ) Bonus fixes - logging improvements to ensure that we don't use `go logger.LogIf` to avoid runtime.Caller missing the function name. log where necessary. - remove unused code at erasure sets	2020-10-22 13:36:24 -07:00
Harshavardhana	ad726b49b4	rename zones to serverSets to avoid terminology conflict (#10679 ) we are bringing in availability zones, we should avoid zones as per server expansion concept.	2020-10-15 14:28:50 -07:00
Harshavardhana	f1cc16e788	fix: background heal rely on getOnlineDisks() (#10687 )	2020-10-15 13:06:23 -07:00
Harshavardhana	71b97fd3ac	fix: connect disks pre-emptively during startup (#10669 ) connect disks pre-emptively upon startup, to ensure we have enough disks are connected at startup rather than wait for them. we need to do this to avoid long wait times for server to be online when we have servers come up in rolling upgrade fashion	2020-10-13 18:28:42 -07:00
Harshavardhana	2760fc86af	Bump default idleConnsPerHost to control conns in time_wait (#10653 ) This PR fixes a hang which occurs quite commonly at higher concurrency by allowing following changes - allowing lower connections in time_wait allows faster socket open's - lower idle connection timeout to ensure that we let kernel reclaim the time_wait connections quickly - increase somaxconn to 4096 instead of 2048 to allow larger tcp syn backlogs. fixes #10413	2020-10-12 14:19:46 -07:00
Harshavardhana	6484453fc6	optionally allow strict quorum listing (#10649 ) ``` export MINIO_API_LIST_STRICT_QUORUM=on ``` would enable listing in quorum if necessary	2020-10-09 15:40:46 -07:00
Harshavardhana	736e58dd68	fix: handle concurrent lockers with multiple optimizations (#10640 ) - select lockers which are non-local and online to have affinity towards remote servers for lock contention - optimize lock retry interval to avoid sending too many messages during lock contention, reduces average CPU usage as well - if bucket is not set, when deleteObject fails make sure setPutObjHeaders() honors lifecycle only if bucket name is set. - fix top locks to list out always the oldest lockers always, avoid getting bogged down into map's unordered nature.	2020-10-08 12:32:32 -07:00
Harshavardhana	66174692a2	add '.healing.bin' for tracking currently healing disk (#10573 ) add a hint on the disk to allow for tracking fresh disk being healed, to allow for restartable heals, and also use this as a way to track and remove disks. There are more pending changes where we should move all the disk formatting logic to backend drives, this PR doesn't deal with this refactor instead makes it easier to track healing in the future.	2020-09-28 19:39:32 -07:00
Harshavardhana	eafa775952	fix: add lock ownership to expire locks (#10571 ) - Add owner information for expiry, locking, unlocking a resource - TopLocks returns now locks in quorum by default, provides a way to capture stale locks as well with `?stale=true` - Simplify the quorum handling for locks to avoid from storage class, because there were challenges to make it consistent across all situations. - And other tiny simplifications to reset locks.	2020-09-25 19:21:52 -07:00
Harshavardhana	ca989eb0b3	avoid ListBuckets returning quorum errors when node is down (#10555 ) Also, revamp the way ListBuckets work make few portions of the healing logic parallel - walk objects for healing disks in parallel - collect the list of buckets in parallel across drives - provide consistent view for listBuckets()	2020-09-24 09:53:38 -07:00

1 2

79 Commits