minio

Commit Graph

Author	SHA1	Message	Date
Klaus Post	f939d1c183	Independent Multipart Uploads (#15346 ) Do completely independent multipart uploads. In distributed mode, a lock was held to merge each multipart upload as it was added. This lock was highly contested and retries are expensive (timewise) in distributed mode. Instead, each part adds its metadata information uniquely. This eliminates the per object lock required for each to merge. The metadata is read back and merged by "CompleteMultipartUpload" without locks when constructing final object. Co-authored-by: Harshavardhana <harsha@minio.io>	2022-07-19 08:35:29 -07:00
Harshavardhana	7da9e3a6f8	support encrypted/compressed objects properly during decommission (#15320 ) fixes #15314	2022-07-16 19:35:24 -07:00
Anis Elleuch	876970baea	Exclude upload-ids with incomplete part upload in multipart listing (#15318 ) Uploading a part object can leave an inconsistent state inside .minio.sys/multipart where data are uploaded but xl.meta is not committed yet. Do not list upload-ids that have this state in the multipart listing.	2022-07-16 13:25:58 -07:00
Harshavardhana	1b339ea062	allow force delete on decom pool (#15302 ) Bonus: - skip suspended pool from being considered for multipart uploads - add more context for decomErrors()	2022-07-14 20:44:22 -07:00
Harshavardhana	236ef03dbd	fix: skip objects expired via lifecycle rules during decommission (#15300 )	2022-07-14 16:47:09 -07:00
Klaus Post	911a17b149	Add compressed file index (#15247 )	2022-07-11 17:30:56 -07:00
Praveen raj Mani	b49fc33cb3	purge objects immediately with `x-minio-force-delete` in DeleteObject and DeleteBucket API (#15148 )	2022-07-11 09:15:54 -07:00
Anis Elleuch	54a061bdda	Save minio version information centrally (#15181 )	2022-06-29 14:45:49 -07:00
Harshavardhana	9c605ad153	allow support for parity '0', '1' enabling support for 2,3 drive setups (#15171 ) allows for further granular setups - 2 drives (1 parity, 1 data) - 3 drives (1 parity, 2 data) Bonus: allows '0' parity as well.	2022-06-27 20:22:18 -07:00
Harshavardhana	6722f58668	save MinIO version with each version (8-bytes extra) (#15170 ) store MinIO version along with each version in 'xl.meta' for future purposes, can be used as ways to add specific code for bug fixes if any.	2022-06-27 03:59:41 -07:00
Harshavardhana	52221db7ef	fix: for unexpected errors in reading versioning config panic (#14994 ) We need to make sure if we cannot read bucket metadata for some reason, and bucket metadata is not missing and returning corrupted information we should panic such handlers to disallow I/O to protect the overall state on the system. In-case of such corruption we have a mechanism now to force recreate the metadata on the bucket, using `x-minio-force-create` header with `PUT /bucket` API call. Additionally fix the versioning config updated state to be set properly for the site replication healing to trigger correctly.	2022-05-31 02:57:57 -07:00
Harshavardhana	c7df1ffc6f	avoid concurrent reads and writes to opts.UserDefined (#14862 ) do not modify opts.UserDefined after object-handler has set all the necessary values, any mutation needed should be done on a copy of this value not directly. As there are other pieces of code that access opts.UserDefined concurrently this becomes challenging. fixes #14856	2022-05-05 04:14:41 -07:00
Anis Elleuch	44a3b58e52	Add audit log for decommissioning (#14858 )	2022-05-04 00:45:27 -07:00
Klaus Post	64d4da5a37	Add Put input readahead (#14084 ) When reading input for PutObject or PutObjectPart add a readahead buffer for big inputs. This will make network reads+hashing separate run async with erasure coding and writes. This will reduce overall latency in distributed setups where the input is from upstream and writes go to other servers. We will read at 2 buffers ahead, meaning one will always be ready/waiting and one is currently being read from. This improves PutObject and PutObjectParts for these cases.	2022-01-14 10:01:25 -08:00
Harshavardhana	f546636c52	fix: use renameAll instead of deleteObject() for purging temporary files (#14096 ) This PR simplifies few things - Multipart parts are renamed, upon failure are unrenamed() keep this multipart specific behavior it is needed and works fine. - AbortMultipart should blindly delete once lock is acquired instead of re-reading metadata and calculating quorum, abort is a delete() operation and client has no business looking for errors on this. - Skip Access() calls to folders that are operating on `.minio.sys/multipart` folder as well.	2022-01-13 11:07:41 -08:00
Harshavardhana	f527c708f2	run gofumpt cleanup across code-base (#14015 )	2022-01-02 09:15:06 -08:00
Harshavardhana	c791de0e1e	re-implement pickValidInfo dataDir, move to quorum calculation (#13681 ) dataDir loosely based on maxima is incorrect and does not work in all situations such as disks in the following order - xl.json migration to xl.meta there may be partial xl.json's leftover if some disks are not yet connected when the disk is yet to come up, since xl.json mtime and xl.meta is same the dataDir maxima doesn't work properly leading to quorum issues. - its also possible that XLV1 might be true among the disks available, make sure to keep FileInfo based on common quorum and skip unexpected disks with the older data format. Also, this PR tests upgrade from older to a newer release if the data is readable and matches the checksum. NOTE: this is just initial work we can build on top of this to do further tests.	2021-11-21 10:41:30 -08:00
Krishnan Parthasarathi	31d7cc2cd4	erasure: Set fi.IsLatest when adding a new version (#13277 )	2021-09-22 19:17:09 -07:00
Harshavardhana	0892f1e406	fix: multipart replication and encrypted etag for sse-s3 (#13171 ) Replication was not working properly for encrypted objects in single PUT object for preserving etag, We need to make sure to preserve etag such that replication works properly and not gets into infinite loops of copying due to ETag mismatches.	2021-09-08 22:25:23 -07:00
Harshavardhana	e9d970154d	use renameAll instead of deleteAll for metacache-manager (#13005 ) renameAll is cheaper, rely on background deletes instead.	2021-08-19 09:16:14 -07:00
Harshavardhana	40a2fa8e81	fix: add more optimizations to putMetacacheObject() (#12916 ) - avoid extra lookup for 'xl.meta' since we are definitely sure that it doesn't exist. - use this in newMultipartUpload() as well - also additionally do not write with O_DSYNC to avoid loading the drives, instead create 'xl.meta' for listing operations without O_DSYNC since these are ephemeral objects. - do the same with newMultipartUpload() since it gets synced when the PutObjectPart() is attempted, we do not need to tax newMultipartUpload() instead.	2021-08-10 11:12:22 -07:00
Harshavardhana	035882d292	fix: remove parentIsObject() check (#12851 ) we will allow situations such as ``` a/b/1.txt a/b ``` and ``` a/b a/b/1.txt ``` we are going to document that this usecase is not supported and we will never support it, if any application does this users have to delete the top level parent to make sure namespace is accessible at lower level. rest of the situations where the prefixes get created across sets are supported as is.	2021-08-03 13:26:57 -07:00
Anis Elleuch	b0b4696a64	heal: Add MRF metrics to background heal API response (#12398 ) This commit gathers MRF metrics from all nodes in a cluster and return it to the caller. This will show information about the number of objects in the MRF queues waiting to be healed.	2021-07-15 22:32:06 -07:00
Klaus Post	33cee9f38a	Improve multipart upload (#12514 ) Each multipart upload is holding a read lock for the entire upload duration of each part. This makes it impossible for other parts to complete until all currently uploading parts have released their locks. It will also make it impossible for new parts to start as long as the write lock is still being requested, essentially deadlocking uploads until all that may have been granted a read lock has been completed. Refactor to only hold the upload id lock while reading and writing the metadata, but hold a part id lock while the part is being uploaded.	2021-06-16 13:21:36 -07:00
Harshavardhana	1f262daf6f	rename all remaining packages to internal/ (#12418 ) This is to ensure that there are no projects that try to import `minio/minio/pkg` into their own repo. Any such common packages should go to `https://github.com/minio/pkg`	2021-06-01 14:59:40 -07:00
Harshavardhana	81d5688d56	move the dependency to minio/pkg for common libraries (#12397 )	2021-05-28 15:17:01 -07:00
Harshavardhana	89bb9f17d7	fix: when parityDrives hits > len(storageDisks)/2, keep maxParity (#12387 ) Additionally move out `x-minio-internal-erasure-upgraded` from HTTP headers list, as its an internal header, rename elsewhere accordingly.	2021-05-27 13:38:04 -07:00
Klaus Post	acc452b7ce	Add more erasure codes on degraded systems. (#11852 ) In cases where a cluster is degraded, we do not uphold our consistency guarantee and we will write fewer erasure codes and rely on healing to recreate the missing shards. In some cases replacing known bad disks in practice take days. We want to change the behavior of a known degraded system to keep the erasure code promise of the storage class for each object. This will create the objects with the same confidence as a fully functional cluster. The tradeoff will be that objects created during a partial outage will take up slightly more space. This means that when the storage class is EC:4, there should always be written 4 parity shards, even if some disks are unavailable. When an object is created on a set, the disks are immediately checked. If any disks are unavailable additional parity shards will be made for each offline disk, up to 50% of the number of disks. We add an internal metadata field with the actual and intended erasure code level, this can optionally be picked up later by the scanner if we decide that data like this should be re-sharded.	2021-05-27 11:38:09 -07:00
Klaus Post	cde6469b88	Fix hanging erasure writes (#12253 ) However, this slice is also used for closing the writers, so close is never called on these. Furthermore when an error is returned from a write it is now reported to the reader. bonus: remove unused heal param from `newBitrotWriter`. * Remove copy, now that we don't mutate.	2021-05-17 08:32:28 -07:00
Harshavardhana	f1e479d274	remove more duplicate bloom filter trackers (#12302 ) At some places bloom filter tracker was getting updated for `.minio.sys/tmp` bucket, there is no reason to update bloom filters for those. And add a missing bloom filter update for MakeBucket() Bonus: purge unused function deleteEmptyDir()	2021-05-17 08:25:48 -07:00
Harshavardhana	64f6020854	fix: cleanup locking, cancel context upon lock timeout (#12183 ) upon errors to acquire lock context would still leak, since the cancel would never be called. since the lock is never acquired - proactively clear it before returning.	2021-04-29 20:55:21 -07:00
Anis Elleuch	9e797532dc	lock: Always cancel the returned Get(R)Lock context (#12162 ) * lock: Always cancel the returned Get(R)Lock context There is a leak with cancel created inside the locking mechanism. The cancel purpose was to cancel operations such erasure get/put that are holding non-refreshable locks. This PR will ensure the created context.Cancel is passed to the unlock API so it will cleanup and avoid leaks. * locks: Avoid returning nil cancel in local lockers Since there is no Refresh mechanism in the local locking mechanism, we do not generate a new context or cancel. Currently, a nil cancel function is returned but this can cause a crash. Return a dummy function instead.	2021-04-27 16:12:50 -07:00
Poorna Krishnamoorthy	4be0f92067	Fix multipart restore to remove part match (#12161 ) Part ETags are not available after multipart finalizes, removing this check as not useful. Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-04-26 18:24:06 -07:00
Harshavardhana	4eb9b6eaf8	preserve metadata multipart restore (#12139 ) avoid re-read of xl.meta instead just use the success criteria from PutObjectPart() and check the ETag matches per Part, if they match then the parts have been successfully restored as is. Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-24 19:07:27 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	a7acfa6158	fix: pick valid FileInfo additionally based on dataDir (#12116 ) * fix: pick valid FileInfo additionally based on dataDir historically we have always relied on modTime to be consistent and same, we can now add additional reference to look for the same dataDir value. A dataDir is the same for an object at a given point in time for a given version, let's say a `null` version is overwritten in quorum we do not by mistake pick up the fileInfo's incorrectly. * make sure to not preserve fi.Data Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-21 19:06:08 -07:00
Harshavardhana	2ef824bbb2	collapse two distinct calls into single RenameData() call (#12093 ) This is an optimization by reducing one extra system call, and many network operations. This reduction should increase the performance for small file workloads.	2021-04-20 10:44:39 -07:00
Anis Elleuch	8d5456c15a	Fix error returned by HealObject in some cases (#11906 ) The background healing can return NoSuchUpload error, the reason is that healing code can return errFileNotFound with three parameters. Simplify the code by returning exact errUploadNotFound error in multipart code. Also ensure that a typed error is always returned whatever the number of parameters because it is better than showing internal error.	2021-03-26 11:17:23 -07:00
Harshavardhana	d7f32ad649	xl: avoid sending Delete() remote call for fully successful runs an optimization to avoid extra syscalls in PutObject(), adds up to our PutObject response times.	2021-03-24 17:32:12 -07:00
Harshavardhana	51a8619a79	[feat] Add configurable deadline for writers (#11822 ) This PR adds deadlines per Write() calls, such that slow drives are timed-out appropriately and the overall responsiveness for Writes() is always up to a predefined threshold providing applications sustained latency even if one of the drives is slow to respond.	2021-03-18 14:09:55 -07:00
Harshavardhana	6160188bf3	fix: erasure index based reading based on actual ParityBlocks (#11792 ) in some setups with ordering issues in drive configuration, we should rely on expected parityBlocks instead of `len(disks)/2`	2021-03-15 20:03:13 -07:00
Anis Elleuch	7be7109471	locking: Add Refresh for better locking cleanup (#11535 ) Co-authored-by: Anis Elleuch <anis@min.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-03-03 18:36:43 -08:00
Klaus Post	c3217bd6eb	Use actual size for buffer selection (#11687 ) For compressed inputs, this will be -1, but the object may be small.	2021-03-03 16:28:10 -08:00
Harshavardhana	6386b45c08	[feat] use rename instead of recursive deletes (#11641 ) most of the delete calls today spend time in a blocking operation where multiple calls need to be recursively sent to delete the objects, instead we can use rename operation to atomically move the objects from the namespace to `tmp/.trash` we can schedule deletion of objects at this location once in 15, 30mins and we can also add wait times between each delete operation. this allows us to make delete's faster as well less chattier on the drives, each server runs locally a groutine which would clean this up regularly.	2021-02-26 09:52:27 -08:00
Klaus Post	85620dfe93	use bucket in path in distribution hash (#11634 ) Use bucket in erasure distribution hash. For the rare cases where objects with the same names are uploaded to many buckets.	2021-02-25 10:11:31 -08:00
Harshavardhana	b3c56b53fb	fix: metacache should only rename entries during cleanup (#11503 ) To avoid large delays in metacache cleanup, use rename instead of recursive delete calls, renames are cheaper move the content to minioMetaTmpBucket and then cleanup this folder once in 24hrs instead. If the new cache can replace an existing one, we should let it replace since that is currently being saved anyways, this avoids pile up of 1000's of metacache entires for same listing calls that are not necessary to be stored on disk.	2021-02-11 10:22:03 -08:00
Krishnan Parthasarathi	b87fae0049	Simplify PutObjReader for plain-text reader usage (#11470 ) This change moves away from a unified constructor for plaintext and encrypted usage. NewPutObjReader is simplified for the plain-text reader use. For encrypted reader use, WithEncryption should be called on an initialized PutObjReader. Plaintext: func NewPutObjReader(rawReader hash.Reader) PutObjReader The hash.Reader is used to provide payload size and md5sum to the downstream consumers. This is different from the previous version in that there is no need to pass nil values for unused parameters. Encrypted: func WithEncryption(encReader hash.Reader, key crypto.ObjectKey) (*PutObjReader, error) This method sets up encrypted reader along with the key to seal the md5sum produced by the plain-text reader (already setup when NewPutObjReader was called). Usage: ``` pReader := NewPutObjReader(rawReader) // ... other object handler code goes here // Prepare the encrypted hashed reader pReader, err = pReader.WithEncryption(encReader, objEncKey) ```	2021-02-10 08:52:50 -08:00
Harshavardhana	4315f93421	fix: make sure parentDirIsObject is used at set level (#11280 ) parentDirIsObject is not using set level understanding to check for parent objects, without this it can lead to objects that can actually reside on a separate set as objects and would conflict.	2021-01-17 01:11:48 -08:00
Harshavardhana	f903cae6ff	Support variable server pools (#11256 ) Current implementation requires server pools to have same erasure stripe sizes, to facilitate same SLA and expectations. This PR allows server pools to be variadic, i.e they do not have to be same erasure stripe sizes - instead they should have SLA for parity ratio. If the parity ratio cannot be guaranteed by the new server pool, the deployment is rejected i.e server pool expansion is not allowed.	2021-01-16 12:08:02 -08:00
Harshavardhana	f21d650ed4	fix: readData in bulk call using messagepack byte wrappers (#11228 ) This PR refactors the way we use buffers for O_DIRECT and to re-use those buffers for messagepack reader writer. After some extensive benchmarking found that not all objects have this benefit, and only objects smaller than 64KiB see this benefit overall. Benefits are seen from almost all objects from 1KiB - 32KiB Beyond this no objects see benefit with bulk call approach as the latency of bytes sent over the wire v/s streaming content directly from disk negate each other with no remarkable benefits. All other optimizations include reuse of msgp.Reader, msgp.Writer using sync.Pool's for all internode calls.	2021-01-07 19:27:31 -08:00

1 2

84 Commits