minio

Commit Graph

Author	SHA1	Message	Date
Anis Eleuch	24b4f9d748	Fix quorum calculation with zero parity objects (#19250 ) Currently, the code relies on object parity to decide whether it is a delete marker or a regular object. In the case of a delete marker, the return quorum is half of the disks in the erasure set. However, this calculation must be corrected with objects with EC = 0, mainly because EC is not a one-time fixed configuration. Though all data are correct, the manifested symptom is a 503 with an EC=0 object. This bug was manifested after we introduced the fast Get Object feature that does not read all data from all disks in case of inlined objects	2024-03-12 12:59:11 -07:00
Krishnan Parthasarathi	a7577da768	Improve expiration of tiered objects (#18926 ) - Use a shared worker pool for all ILM expiry tasks - Free version cleanup executes in a separate goroutine - Add a free version only if removing the remote object fails - Add ILM expiry metrics to the node namespace - Move tier journal tasks to expiryState - Remove unused on-disk journal for tiered objects pending deletion - Distribute expiry tasks across workers such that the expiry of versions of the same object serialized - Ability to resize worker pool without server restart - Make scaling down of expiryState workers' concurrency safe; Thanks @klauspost - Add error logs when expiryState and transition state are not initialized (yet) * metrics: Add missed tier journal entry tasks * Initialize the ILM worker pool after the object layer	2024-03-01 21:11:03 -08:00
Harshavardhana	c599c11e70	fix: relax metadata checks for healing (#19165 ) we should do this to ensure that we focus on data healing as primary focus, fixing metadata as part of healing must be done but making data available is the main focus. the main reason is metadata inconsistencies can cause data availability issues, which must be avoided at all cost. will be bringing in an additional healing mechanism that involves "metadata-only" heal, for now we do not expect to have these checks. continuation of #19154 Bonus: add a pro-active healthcheck to perform a connection	2024-02-29 22:49:01 -08:00
Harshavardhana	80ca120088	remove checkBucketExist check entirely to avoid fan-out calls (#18917 ) Each Put, List, Multipart operations heavily rely on making GetBucketInfo() call to verify if bucket exists or not on a regular basis. This has a large performance cost when there are tons of servers involved. We did optimize this part by vectorizing the bucket calls, however its not enough, beyond 100 nodes and this becomes fairly visible in terms of performance.	2024-01-30 12:43:25 -08:00
Harshavardhana	708cebe7f0	add necessary protection err, fileInfo slice reads and writes (#18854 ) protection was in place. However, it covered only some areas, so we re-arranged the code to ensure we could hold locks properly. Along with this, remove the DataShardFix code altogether, in deployments with many drive replacements, this can affect and lead to quorum loss.	2024-01-24 01:08:23 -08:00
Harshavardhana	dd2542e96c	add codespell action (#18818 ) Original work here, #18474, refixed and updated.	2024-01-17 23:03:17 -08:00
Harshavardhana	38637897ba	fix: listing SSE encrypted multipart objects (#18786 ) GetActualSize() was heavily relying on o.Parts() to be non-empty to figure out if the object is multipart or not, However, we have many indicators of whether an object is multipart or not. Blindly assuming that o.Parts == nil is not a multipart, is an incorrect expectation instead, multipart must be obtained via - Stored metadata value indicating this is a multipart encrypted object. - Rely on <meta>-actual-size metadata to get the object's actual size. This value is preserved for additional reasons such as these. - ETag != 32 length	2024-01-15 00:57:49 -08:00
Harshavardhana	fba883839d	feat: bring new HDD related performance enhancements (#18239 ) Optionally allows customers to enable - Enable an external cache to catch GET/HEAD responses - Enable skipping disks that are slow to respond in GET/HEAD when we have already achieved a quorum	2023-11-22 13:46:17 -08:00
Poorna	96ec8fcba1	Preserve replica timestamps in multipart (#18318 ) Also a backward compatibility fix to use x-amz-replica-status if present as replication status.	2023-10-25 21:24:10 -07:00
Aditya Manthramurthy	1c99fb106c	Update to minio/pkg/v2 (#17967 )	2023-09-04 12:57:37 -07:00
Krishnan Parthasarathi	71c32e9b48	Return successorModTime in quorum when available (#17925 )	2023-09-04 08:24:17 -07:00
Harshavardhana	18b3655c99	with xlv2 format we never had to fill in checksumInfo() (#17963 ) - this PR avoids sending a large ChecksumInfo slice when its not needed - also for a file with XLV2 format there is no reason to allocate Checksum slice while reading	2023-09-01 13:45:58 -07:00
Harshavardhana	1ea7826c0e	do not have to consider replicationTimestamp for healing and quorum (#17922 ) replicationTimestamp might differ if there were retries in replication and the retried attempt overwrote in quorum but enough shards with newer timestamp causing the existing timestamps on xl.meta to be invalid, we do not rely on this value for anything external. this is purely a hint for debugging purposes, but there is no real value in it considering the object itself is in-tact we do not have to spend time healing this situation. we may consider healing this situation in future but that needs to be decoupled to make sure that we do not over calculate how much we have to heal.	2023-08-25 15:31:15 -07:00
Harshavardhana	9ebd10d3f4	Revert "Include SuccessorModTime for FileInfo quorum (#17732 )" (#17860 ) This reverts commit `bf3901342c`. This is to fix a regression caused when there are inconsistent versions, but one version is in quorum. SuccessorModTime issue must be fixed differently.	2023-08-16 07:51:33 -07:00
Krishnan Parthasarathi	bf3901342c	Include SuccessorModTime for FileInfo quorum (#17732 )	2023-07-26 17:04:16 -07:00
Harshavardhana	a566bcf613	treat 0-byte objects to honor same quorum as delete marker (#17633 ) on unversioned buckets its possible that 0-byte objects might lose quorum on flaky systems, allow them to be same as DELETE markers. Since practically speak they have no content.	2023-07-11 21:53:49 -07:00
Harshavardhana	1443b5927a	allow quorum fileInfo to pick same parityBlocks (#17454 ) Bonus: allow replication to proceed for 503 errors such as with error code SlowDownRead	2023-06-18 18:20:15 -07:00
Harshavardhana	64de61d15d	fallback on etags if they match when mtime is not same (#17424 ) on "unversioned" buckets there are situations when successive concurrent I/O can lead to an inconsistent state() with mtime while the etag might be the same for the object on disk. in such a scenario it is possible for us to allow reading of the object since etag matches and if etag matches we are guaranteed that we have enough copies the object will be readable and same. This PR allows fallback in such scenarios.	2023-06-17 19:18:20 -07:00
jiuker	0474791cf8	fix: set time format right (#17402 )	2023-06-14 07:49:13 -07:00
Harshavardhana	d5059840ef	fix: for delete marked objects choose appropriate parity (#17287 )	2023-05-26 09:57:44 -07:00
Anis Eleuch	a30a55f3b1	Add object parity in listing V2M and listing versions M (#17238 )	2023-05-19 09:42:45 -07:00
Praveen raj Mani	72802a5972	Use 'minio/pkg/sync/errgroup' and 'minio/pkg/workers' (#17069 )	2023-04-25 22:57:40 -07:00
Harshavardhana	8fd07bcd51	simplify sort.Sort by using sort.Slice (#17066 )	2023-04-24 13:28:18 -07:00
Harshavardhana	6825bd7e75	fix: inlined objects don't need to honor long locks (#17039 )	2023-04-17 12:16:37 -07:00
Krishnan Parthasarathi	f92450d8b3	commonParity should pick readable FileInfo (#17032 )	2023-04-14 16:23:28 -07:00
Harshavardhana	72daccd468	fix: scanner in healing cycle must use actual size (#16589 )	2023-02-10 06:53:03 -08:00
Harshavardhana	c242e6c391	fix: calculate common parity properly (#16406 )	2023-01-13 03:28:16 +05:30
Harshavardhana	2937711390	fix: DeleteObject() API with versionId under replication (#16325 )	2022-12-28 22:48:33 -08:00
Krishnan Parthasarathi	40a2c6b882	Return remote tier as StorageClass for transitioned objects (#16035 )	2022-11-09 15:57:34 -08:00
Anis Elleuch	783dd875f7	refactor objectQuorumFromMeta() to search for parity quorum (#15844 )	2022-10-12 16:42:45 -07:00
Harshavardhana	228c6686f8	allow non-standards fallback for all http.TimeFormats (#15662 ) fixes #15645	2022-09-07 07:24:54 -07:00
Harshavardhana	7776d064cf	allow non-standards fallback for Expires header (#15655 ) fixes #15645	2022-09-05 19:18:18 -07:00
Klaus Post	a9f1ad7924	Add extended checksum support (#15433 )	2022-08-29 16:57:16 -07:00
Harshavardhana	65166e4ce4	fix: readQuorum calculation when defaultParityCount is 0 (#15363 ) when parity is '0' the readQuorum must be equal to the number of data disks.	2022-07-21 07:25:54 -07:00
Harshavardhana	ce8397f7d9	use partInfo only for intermediate part.x.meta (#15353 )	2022-07-19 18:56:24 -07:00
Klaus Post	911a17b149	Add compressed file index (#15247 )	2022-07-11 17:30:56 -07:00
Harshavardhana	9c605ad153	allow support for parity '0', '1' enabling support for 2,3 drive setups (#15171 ) allows for further granular setups - 2 drives (1 parity, 1 data) - 3 drives (1 parity, 2 data) Bonus: allows '0' parity as well.	2022-06-27 20:22:18 -07:00
Harshavardhana	52221db7ef	fix: for unexpected errors in reading versioning config panic (#14994 ) We need to make sure if we cannot read bucket metadata for some reason, and bucket metadata is not missing and returning corrupted information we should panic such handlers to disallow I/O to protect the overall state on the system. In-case of such corruption we have a mechanism now to force recreate the metadata on the bucket, using `x-minio-force-create` header with `PUT /bucket` API call. Additionally fix the versioning config updated state to be set properly for the site replication healing to trigger correctly.	2022-05-31 02:57:57 -07:00
Harshavardhana	f1abb92f0c	feat: Single drive XL implementation (#14970 ) Main motivation is move towards a common backend format for all different types of modes in MinIO, allowing for a simpler code and predictable behavior across all features. This PR also brings features such as versioning, replication, transitioning to single drive setups.	2022-05-30 10:58:37 -07:00
Harshavardhana	9d07cde385	use crypto/sha256 only for FIPS 140-2 compliance (#14983 ) It would seem like the PR #11623 had chewed more than it wanted to, non-fips build shouldn't really be forced to use slower crypto/sha256 even for presumed "non-performance" codepaths. In MinIO there are really no "non-performance" codepaths. This assumption seems to have had an adverse effect in certain areas of CPU usage. This PR ensures that we stick to sha256-simd on all non-FIPS builds, our most common build to ensure we get the best out of the CPU at any given point in time.	2022-05-27 06:00:19 -07:00
Anis Elleuch	77dc99e71d	Do not use inline data size in xl.meta quorum calculation (#14831 ) * Do not use inline data size in xl.meta quorum calculation Data shards of one object can different inline/not-inline decision in multiple disks. This happens with outdated disks when inline decision changes. For example, enabling bucket versioning configuration will change the small file threshold. When the parity of an object becomes low, GET object can return 503 because it is not unable to calculate the xl.meta quorum, just because some xl.meta has inline data and other are not. So this commit will be disable taking the size of the inline data into consideration when calculating the xl.meta quorum. * Add tests for simulatenous inline/notinline object Co-authored-by: Anis Elleuch <anis@min.io>	2022-05-24 06:26:38 -07:00
Krishnan Parthasarathi	ad8e611098	feat: implement prefix-level versioning exclusion (#14828 ) Spark/Hadoop workloads which use Hadoop MR Committer v1/v2 algorithm upload objects to a temporary prefix in a bucket. These objects are 'renamed' to a different prefix on Job commit. Object storage admins are forced to configure separate ILM policies to expire these objects and their versions to reclaim space. Our solution: This can be avoided by simply marking objects under these prefixes to be excluded from versioning, as shown below. Consequently, these objects are excluded from replication, and don't require ILM policies to prune unnecessary versions. - MinIO Extension to Bucket Version Configuration ```xml <VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Status>Enabled</Status> <ExcludeFolders>true</ExcludeFolders> <ExcludedPrefixes> <Prefix>app1-jobs//_temporary/</Prefix> </ExcludedPrefixes> <ExcludedPrefixes> <Prefix>app2-jobs//__magic/</Prefix> </ExcludedPrefixes> <!-- .. up to 10 prefixes in all --> </VersioningConfiguration> ``` Note: `ExcludeFolders` excludes all folders in a bucket from versioning. This is required to prevent the parent folders from accumulating delete markers, especially those which are shared across spark workloads spanning projects/teams. - To enable version exclusion on a list of prefixes ``` mc version enable --excluded-prefixes "app1-jobs//_temporary/,app2-jobs//_magic," --exclude-prefix-marker myminio/test ```	2022-05-06 19:05:28 -07:00
Krishnan Parthasarathi	5a0c0079a1	Don't add free-version on restore-object (#14340 )	2022-02-17 15:05:19 -08:00
Harshavardhana	0256dae657	fix: quorum requirement for DeleteMarkers and parity upgraded objects (#14248 ) DeleteMarkers do not have a default quorum, i.e it is possible that DeleteMarkers were created with n/2+1 quorum as well to make sure that we satisfy situations such as those we need to make sure delete markers only expect n/2 read quorum. Additionally we should also look at additional metadata on the actual objects that might have been "erasure" upgraded with new parity when disks are down. In such a scenario do not default to the standard storage class parity, instead use the parityBlocks present on the FileInfo to ensure that we are dealing with the correct quorum for READs and DELETEs.	2022-02-04 02:47:36 -08:00
Harshavardhana	54ec0a1308	add configurable delta for skipping shards (#13967 ) This PR is an attempt to make this configurable as not all situations have same level of tolerable delta, i.e disks are replaced days apart or even hours. There is also a possibility that nodes have drifted in time, when NTP is not configured on the system.	2021-12-22 11:43:01 -08:00
Harshavardhana	0e3037631f	skip inconsistent shards if possible (#13945 ) data shards were wrong due to a healing bug reported in #13803 mainly with unaligned object sizes. This PR is an attempt to automatically avoid these shards, with available information about the `xl.meta` and actually disk mtime.	2021-12-21 10:08:26 -08:00
Harshavardhana	28f95f1fbe	quorum calculation getLatestFileInfo should be itself (#13717 ) FileInfo quorum shouldn't be passed down, instead inferred after obtaining a maximally occurring FileInfo. This PR also changes other functions that rely on wrong quorum calculation. Update tests as well to handle the proper requirement. All these changes are needed when migrating from older deployments where we used to set N/2 quorum for reads to EC:4 parity in newer releases.	2021-11-22 09:36:29 -08:00
Harshavardhana	c791de0e1e	re-implement pickValidInfo dataDir, move to quorum calculation (#13681 ) dataDir loosely based on maxima is incorrect and does not work in all situations such as disks in the following order - xl.json migration to xl.meta there may be partial xl.json's leftover if some disks are not yet connected when the disk is yet to come up, since xl.json mtime and xl.meta is same the dataDir maxima doesn't work properly leading to quorum issues. - its also possible that XLV1 might be true among the disks available, make sure to keep FileInfo based on common quorum and skip unexpected disks with the older data format. Also, this PR tests upgrade from older to a newer release if the data is readable and matches the checksum. NOTE: this is just initial work we can build on top of this to do further tests.	2021-11-21 10:41:30 -08:00
Klaus Post	faf013ec84	Improve performance on multiple versions (#13573 ) Existing: ```go type xlMetaV2 struct { Versions []xlMetaV2Version `json:"Versions" msg:"Versions"` } ``` Serialized as regular MessagePack. ```go //msgp:tuple xlMetaV2VersionHeader type xlMetaV2VersionHeader struct { VersionID [16]byte ModTime int64 Type VersionType Flags xlFlags } ``` Serialize as streaming MessagePack, format: ``` int(headerVersion) int(xlmetaVersion) int(nVersions) for each version { binary blob, xlMetaV2VersionHeader, serialized binary blob, xlMetaV2Version, serialized. } ``` xlMetaV2VersionHeader is <= 30 bytes serialized. Deserialized struct can easily be reused and does not contain pointers, so efficient as a slice (single allocation) This allows quickly parsing everything as slices of bytes (no copy). Versions are always saved sorted by modTime, newest first. No more need to sort on load. * Allows checking if a version exists. * Allows reading single version without unmarshal all. * Allows reading latest version of type without unmarshal all. * Allows reading latest version without unmarshal of all. * Allows checking if the latest is deleteMarker by reading first entry. * Allows adding/updating/deleting a version with only header deserialization. * Reduces allocations on conversion to FileInfo(s).	2021-11-18 12:15:22 -08:00
Harshavardhana	661b263e77	add gocritic/ruleguard checks back again, cleanup code. (#13665 ) - remove some duplicated code - reported a bug, separately fixed in #13664 - using strings.ReplaceAll() when needed - using filepath.ToSlash() use when needed - remove all non-Go style comments from the codebase Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>	2021-11-16 09:28:29 -08:00

1 2

91 Commits