Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
Do not error out when a provided marker is before or after the prefix, but instead just ignore it if before and return an empty list when after.
Fixes#18093
Currently we have IOPs of these patterns
```
[OS] os.Mkdir play.min.io:9000 /disk1 2.718µs
[OS] os.Mkdir play.min.io:9000 /disk1/data 2.406µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys 4.068µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp 2.843µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp/d89c8ceb-f8d1-4cc6-b483-280f87c4719f 20.152µs
```
It can be seen that we can save quite Nx levels such as
if your drive is mounted at `/disk1/minio` you can simply
skip sending an `Mkdir /disk1/` and `Mkdir /disk1/minio`.
Since they are expected to exist already, this PR adds a way
for us to ignore all paths upto the mount or a directory which
ever has been provided to MinIO setup.
Two fields in lifecycles made GOB encoding consistently fail with `gob: type lifecycle.Prefix has no exported fields`.
This meant that in distributed systems listings would never be able to continue and would restart on every call.
Fix issues and be sure to log these errors at least once per bucket. We may see some connectivity errors here, but we shouldn't hide them.
Simplify MRF queueing and add backlog handler
- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.
- Change MRF to have each node process only its MRF entries.
- Collect MRF backlog by the node to allow for current backlog visibility
Queue failed/pending replication for healing during listing and GET/HEAD
API calls. This includes healing of existing objects that were never
replicated or those in the middle of a resync operation.
This PR also fixes a bug in ListObjectVersions where lifecycle filtering
should be done.
mergeEntryChannels has the potential to perpetually
wait on the results channel, context might be closed
and we did not honor the caller context canceling.
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
For ListObjects and ListObjectsV2 perform lifecycle checks on
all objects before returning. This will filter out objects that are
pending lifecycle expiration.
Bonus: Cheaper server pool conflict resolution by not converting to FileInfo.
Stop async listing if we have not heard back from the client for 3 minutes.
This will stop spending resources on async listings when they are unlikely to get used.
If the client returns a new listing will be started on the second request.
Stop saving cache metadata to disk. It is cleared on restarts anyway. Removes all
load/save functionality
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.
also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
Fixes brought forward from https://github.com/minio/minio/pull/12605
Fixes resolution when an object is in prefix of another and one zone returns the directory and another the object.
Fixes resolution on single entries that arrive first, so resolution doesn't depend on order.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
ListObjectVersions would skip past the object in the marker when version id is specified.
Make `listPath` return the object with the marker and truncate it if not needed.
Avoid having to parse unintended objects to find a version marker.
most of the delete calls today spend time in
a blocking operation where multiple calls need
to be recursively sent to delete the objects,
instead we can use rename operation to atomically
move the objects from the namespace to `tmp/.trash`
we can schedule deletion of objects at this
location once in 15, 30mins and we can also add
wait times between each delete operation.
this allows us to make delete's faster as well
less chattier on the drives, each server runs locally
a groutine which would clean this up regularly.
store the cache in-memory instead of disks to avoid large
write amplifications for list heavy workloads, store in
memory instead and let it auto expire.
currently crawler waits for an entire readdir call to
return until it processes usage, lifecycle, replication
and healing - instead we should pass the applicator all
the way down to avoid building any special stack for all
the contents in a single directory.
This allows for
- no need to remember the entire list of entries per directory
before applying the required functions
- no need to wait for entire readdir() call to finish before
applying the required functions
To avoid large delays in metacache cleanup, use rename
instead of recursive delete calls, renames are cheaper
move the content to minioMetaTmpBucket and then cleanup
this folder once in 24hrs instead.
If the new cache can replace an existing one, we should
let it replace since that is currently being saved anyways,
this avoids pile up of 1000's of metacache entires for
same listing calls that are not necessary to be stored
on disk.