Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
Only use dynamic delays for the crawler. Even though the max wait was 1 second the number
of waits could severely impact crawler speed.
Instead of relying on a global metric, we use the stateless local delays to keep the crawler
running at a speed more adjusted to current conditions.
The only case we keep it is before bitrot checks when enabled.
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.
allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.
In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.
Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions
```
When specifying the number of days in the NoncurrentVersionTransition
and NoncurrentVersionExpiration actions in a Lifecycle configuration,
note the following:
It is the number of days from when the version of the object becomes
noncurrent (that is, when the object is overwritten or deleted), that
Amazon S3 will perform the action on the specified object or objects.
Amazon S3 calculates the time by adding the number of days specified in
the rule to the time when the new successor version of the object is
created and rounding the resulting time to the next day midnight UTC.
For example, in your bucket, suppose that you have a current version of
an object that was created at 1/1/2014 10:30 AM UTC. If the new version
of the object that replaces the current version is created at 1/15/2014
10:30 AM UTC, and you specify 3 days in a transition rule, the
transition date of the object is calculated as 1/19/2014 00:00 UTC.
```
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
- delete-marker should be created on a suspended bucket as `null`
- delete-marker should delete any pre-existing `null` versioned
object and create an entry `null`
When checking parts we already do a stat for each part.
Since we have the on disk size check if it is at least what we expect.
When checking metadata check if metadata is 0 bytes.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
When crawling never use a disk we know is healing.
Most of the change involves keeping track of the original endpoint on xlStorage
and this also fixes DiskInfo.Endpoint never being populated.
Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is
Healing, skipping` once on a cycle (no more often than every 5m).
fresh drive setups when one of the drive is
a root drive, we should ignore such a root
drive and not proceed to format.
This PR handles this properly by marking
the disks which are root disk and they are
taken offline.
In a non recursive mode, issuing a list request where prefix
is an existing object with a slash and delimiter is a slash will
return entries in the object directory (data dir IDs)
```
$ aws s3api --profile minioadmin --endpoint-url http://localhost:9000 \
list-objects-v2 --bucket testbucket --prefix code_of_conduct.md/ --delimiter '/'
{
"CommonPrefixes": [
{
"Prefix":
"code_of_conduct.md/ec750fe0-ea7e-4b87-bbec-1e32407e5e47/"
}
]
}
```
This commit adds a fast exit track in Walk() in this specific case.
It is possible in situations when server was deployed
in asymmetric configuration in the past such as
```
minio server ~/fs{1...4}/disk{1...5}
```
Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'
While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.
This PR fixes this behavior.
this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.
This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
while we handle all situations for writes and reads
on older format, what we didn't cater for properly
yet was delete where we only ended up deleting
just `xl.meta` - instead we should allow all the
deletes to go through for older format without
versioning enabled buckets.
Currently, lifecycle expiry is deleting all object versions which is not
correct, unless noncurrent versions field is specified.
Also, only delete the delete marker if it is the only version of the
given object.
The S3 specification says that versions are ordered in the response of
list object versions.
mc snapshot needs this to know which version comes first especially when
two versions have the same exact last-modified field.
Looking into full disk errors on zoned setup. We don't take the
5% space requirement into account when selecting a zone.
The interesting part is that even considering this we don't
know the size of the object the user wants to upload when
they do multipart uploads.
It seems quite defensive to always upload multiparts to
the zone where there is the most space since all load will
be directed to a part of the cluster.
In these cases we make sure it can at least hold a 1GiB file
and we disadvantage fuller zones more by subtracting the
expected size before weighing.
This PR fixes all the below scenarios
and handles them correctly.
- existing data/bucket is replaced with
new content, no versioning enabled old
structure vanishes.
- existing data/bucket - enable versioning
before uploading any data, once versioning
enabled upload new content, old content
is preserved.
- suspend versioning on the bucket again, now
upload content again the old content is purged
since that is the default "null" version.
Additionally sync data after xl.json -> xl.meta
rename(), to avoid any surprises if there is a
crash during this rename operation.
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111