Fix various regressions from #18029
* If context is canceled the token is never returned. This will lead to scanner being unable to save and deadlocking.
* Fix backup not being able to get any data (hr empty)
* Reduce backup timeout.
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6
Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.
This PR also does some cleanup in terms of implementation,
fixes#18070
Currently, the retry is not fully used when there is no backup copy of
the data usage; use 5 retry attempts when we don't have any valid data,
new or backup, unless we have seen an un-recognized error.
From the Go specification:
"3. If the map is nil, the number of iterations is 0." [1]
Therefore, an additional nil check for before the loop is unnecessary.
[1]: https://go.dev/ref/spec#For_range
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This allows scanner to avoid lengthy scans, skip
things appropriately and also not lose metrics in
any manner.
reduce longer deadlines for usage-cache loads/saves
to match the disk timeout which is 2minutes now per
IOP.
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.
This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.
Add prometheus metric to track credential errors since uptime
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.
Also removes a tiny bit of CPU by each write operation.
Healing decisions would align with skipped folder counters. This can lead to files
never being selected for heal checks on "clean" paths.
Use different hashing methods and take objectHealProbDiv into account when
calculating the cycle.
Found by @vadmeste
This PR removes an unnecessary state that gets
passed around for DiskIDs, which is not necessary
since each disk exactly knows which pool and which
set it belongs to on a running system.
Currently cached DiskId's won't work properly
because it always ends up skipping offline disks
and never runs healing when disks are offline, as
it expects all the cached diskIDs to be present
always. This also sort of made things in-flexible
in terms perhaps a new diskID for `format.json`.
(however this is not a big issue)
This is an unnecessary requirement that healing
via scanner needs all drives to be online, instead
healing should trigger even when partial nodes
and drives are available this ensures that we
keep the SLA in-tact on the objects when disks
are offline for a prolonged period of time.
currently getReplicationConfig() failure incorrectly
returns error on unexpected buckets upon upgrade, we
should always calculate usage as much as possible.
Remote caches were not returned correctly, so they would not get updated on save.
Furthermore make some tweaks for more reliable updates.
Invalidate bloom filter to ensure rescan.
Also adding an API to allow resyncing replication when
existing object replication is enabled and the remote target
is entirely lost. With the `mc replicate reset` command, the
objects that are eligible for replication as per the replication
config will be resynced to target if existing object replication
is enabled on the rule.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
A cache structure will be kept with a tree of usages.
The cache is a tree structure where each keeps track
of its children.
An uncompacted branch contains a count of the files
only directly at the branch level, and contains link to
children branches or leaves.
The leaves are "compacted" based on a number of properties.
A compacted leaf contains the totals of all files beneath it.
A leaf is only scanned once every dataUsageUpdateDirCycles,
rarer if the bloom filter for the path is clean and no lifecycles
are applied. Skipped leaves have their totals transferred from
the previous cycle.
A clean leaf will be included once every healFolderIncludeProb
for partial heal scans. When selected there is a one in
healObjectSelectProb that any object will be chosen for heal scan.
Compaction happens when either:
- The folder (and subfolders) contains less than dataScannerCompactLeastObject objects.
- The folder itself contains more than dataScannerCompactAtFolders folders.
- The folder only contains objects and no subfolders.
- A bucket root will never be compacted.
Furthermore, if a has more than dataScannerCompactAtChildren recursive
children (uncompacted folders) the tree will be recursively scanned and the
branches with the least number of objects will be compacted until the limit
is reached.
This ensures that any branch will never contain an unreasonable amount
of other branches, and also that small branches with few objects don't
take up unreasonable amounts of space.
Whenever a branch is scanned, it is assumed that it will be un-compacted
before it hits any of the above limits. This will make the branch rebalance
itself when scanned if the distribution of objects has changed.
TLDR; With current values: No bucket will ever have more than 10000
child nodes recursively. No single folder will have more than 2500 child
nodes by itself. All subfolders are compacted if they have less than 500
objects in them recursively.
We accumulate the (non-deletemarker) version count for paths as well,
since we are changing the structure anyway.
This PR fixes
- close leaking bandwidth report channel leakage
- remove the closer requirement for bandwidth monitor
instead if Read() fails remember the error and return
error for all subsequent reads.
- use locking for usage-cache.bin updates, with inline
data we cannot afford to have concurrent writes to
usage-cache.bin corrupting xl.meta
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit adds a new package `etag` for dealing
with S3 ETags.
Even though ETag is often viewed as MD5 checksum of
an object, handling S3 ETags correctly is a surprisingly
complex task. While it is true that the ETag corresponds
to the MD5 for the most basic S3 API operations, there are
many exceptions in case of multipart uploads or encryption.
In worse, some S3 clients expect very specific behavior when
it comes to ETags. For example, some clients expect that the
ETag is a double-quoted string and fail otherwise.
Non-AWS compliant ETag handling has been a source of many bugs
in the past.
Therefore, this commit adds a dedicated `etag` package that provides
functionality for parsing, generating and converting S3 ETags.
Further, this commit removes the ETag computation from the `hash`
package. Instead, the `hash` package (i.e. `hash.Reader`) should
focus only on computing and verifying the content-sha256.
One core feature of this commit is to provide a mechanism to
communicate a computed ETag from a low-level `io.Reader` to
a high-level `io.Reader`.
This problem occurs when an S3 server receives a request and
has to compute the ETag of the content. However, the server
may also wrap the initial body with several other `io.Reader`,
e.g. when encrypting or compressing the content:
```
reader := Encrypt(Compress(ETag(content)))
```
In such a case, the ETag should be accessible by the high-level
`io.Reader`.
The `etag` provides a mechanism to wrap `io.Reader` implementations
such that the `ETag` can be accessed by a type-check.
This technique is applied to the PUT, COPY and Upload handlers.
also additionally make sure errors during deserializer closes
the reader with right error type such that Write() end
actually see the final error, this avoids a waitGroup usage
and waiting.
If the periodic `case <-t.C:` save gets held up for a long time it will end up
synchronize all disk writes for saving the caches.
We add jitter to per set writes so they don't sync up and don't hold a
lock for the write, since it isn't needed anyway.
If an outage prevents writes for a long while we also add individual
waits for each disk in case there was a queue.
Furthermore limit the number of buffers kept to 2GiB, since this could get
huge in large clusters. This will not act as a hard limit but should be enough
for normal operation.