Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.
Also removes a tiny bit of CPU by each write operation.
xl.meta gets written and never rolled back, however
we definitely need to validate the state that is
persisted on the disk, if there are inconsistencies
- more than write quorum we should return an error
to the client
- if write quorum was achieved however there are
inconsistent xl.meta's we should simply trigger
an MRF on them
inlined data often is bigger than the allowed
O_DIRECT alignment, so potentially we can write
'xl.meta' without O_DSYNC instead we can rely on
O_DIRECT + fdatasync() instead.
This PR allows O_DIRECT on inlined data that
would gain the benefits of performing O_DIRECT,
eventually performing an fdatasync() at the end.
Performance boost can be observed here for small
objects < 128KiB. The performance boost is mainly
seen on HDD, and marginal on NVMe setups.
a/b/c/d/ where `a/b/c/` exists results in additional syscalls
such as an Lstat() call to verify if the `a/b/c/` exists
and its a directory.
We do not need to do this on MinIO since the parent prefixes
if exist, we can simply return success without spending
additional syscalls.
Also this implementation attempts to simply use Access() calls
to avoid os.Stat() calls since the latter does memory allocation
for things we do not need to use.
Access() is simpler since we have a predictable structure on
the backend and we know exactly how our path structures are.
Do completely independent multipart uploads.
In distributed mode, a lock was held to merge each multipart
upload as it was added. This lock was highly contested and
retries are expensive (timewise) in distributed mode.
Instead, each part adds its metadata information uniquely.
This eliminates the per object lock required for each to merge.
The metadata is read back and merged by "CompleteMultipartUpload"
without locks when constructing final object.
Co-authored-by: Harshavardhana <harsha@minio.io>
fix: change timedvalue to return previous cached value
caller can interpret the underlying error and decide
accordingly, places where we do not interpret the
errors upon timedValue.Get() - we should simply use
the previously cached value instead of returning "empty".
Bonus: remove some unused code
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
The test expects from DeleteFile to return errDiskNotFound when the disk
is not available. It calls os.RemoveAll() to remove one disk after XL storage
initialization. However, this latter contains some goroutines which can
race with os.RemoveAll() and then the test fails sporadically with
returning random errors.
The commit will tweak the initialization routine of the XL storage to
only run deletion of temporary and metacache data in the background,
so TestXLStorageDeleteFile won't fail anymore.
heal bucket metadata and IAM entries for
sites participating in site replication from
the site with the most updated entry.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <aditya@minio.io>
This PR fixes two issues
- The first fix is a regression from #14555, the fix itself in #14555
is correct but the interpretation of that information by the
object layer code for "replication" was not correct. This PR
tries to fix this situation by making sure the "Delete" replication
works as expected when "VersionPurgeStatus" is already set.
Without this fix, there is a DELETE marker created incorrectly on
the source where the "DELETE" was triggered.
- The second fix is perhaps an older problem started since we inlined-data
on the disk for small objects, CopyObject() incorrectly inline's
a non-inlined data. This is due to the fact that we have code where
we read the `part.1` under certain conditions where the size of the
`part.1` is less than the specific "threshold".
This eventually causes problems when we are "deleting" the data that
is only inlined, which means dataDir is ignored leaving such
dataDir on the disk, that looks like an inconsistent content on
the namespace.
fixes#14767
```
tmp = buf[want:]
```
Would potentially crash when `buf` is truncated for some reason
and does not have the expected bytes, this is of course considered
not normal and is an odd situation. But we do not need to crash
here instead allow for errors to be returned and let callers handle
the errors.
This PR simply adds a warning message when it detects older kernel
versions and warn's them about potential performance issues on this
kernel.
The issue can be seen only with parallel I/O across all drives
on denser setups such as 90 drives or 45 drives per server configurations.