Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Some deployments have low parity (EC:2), but we really do not need to
save our config data with the same parity configuration.
N/2 would be better to keep MinIO configurations intact when unexpected
a number of drives fail.
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo
also fixes bugs where there were double calls to StorageInfo()
This commit removes the `GetObject` method
from the `ObjectLayer` interface.
The `GetObject` method is not longer used by
the HTTP handlers implementing the high-level
S3 semantics. Instead, they use the `GetObjectNInfo`
method which returns both, an object handle as well
as the object metadata.
Therefore, it is no longer necessary that a concrete
`ObjectLayer` implements `GetObject`.
In PR #11165 due to incorrect proxying for 2
way replication even when the object was not
yet replicated
Additionally, fix metadata comparisons when
deciding to do full replication vs metadata copy.
fixes#11340
```
mc admin config set alias/ storage_class standard=EC:3
```
should only succeed if parity ratio is valid for all
server pools, if not we should fail proactively.
This PR also needs to bring other changes now that
we need to cater for variadic drive counts per pool.
Bonus fixes also various bugs reproduced with
- GetObjectWithPartNumber()
- CopyObjectPartWithOffsets()
- CopyObjectWithMetadata()
- PutObjectPart,PutObject with truncated streams
parentDirIsObject is not using set level understanding
to check for parent objects, without this it can lead to
objects that can actually reside on a separate set as
objects and would conflict.
Synchronous replication can be enabled by setting the --sync
flag while adding a remote replication target.
This PR also adds proxying on GET/HEAD to another node in a
active-active replication setup in the event of a 404 on the current node.
Optimizations include
- do not write the metacache block if the size of the
block is '0' and it is the first block - where listing
is attempted for a transient prefix, this helps to
avoid creating lots of empty metacache entries for
`minioMetaBucket`
- avoid the entire initialization sequence of cacheCh
, metacacheBlockWriter if we are simply going to skip
them when discardResults is set to true.
- No need to hold write locks while writing metacache
blocks - each block is unique, per bucket, per prefix
and also is written by a single node.
With new refactor of bucket healing, healing bucket happens
automatically including its metadata, there is no need to
redundant heal buckets also in ListBucketsHeal remove
it.
optimization mainly to avoid listing the entire
`.minio.sys/buckets/.minio.sys` directory, this
can get really huge and comes in the way of startup
routines, contents inside `.minio.sys/buckets/.minio.sys`
are rather transient and not necessary to be healed.
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).
This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
Bonus fixes, we do not need reload format anymore
as the replaced drive is healed locally we only need
to ensure that drive heal reloads the drive properly.
We preserve the UUID of the original order, this means
that the replacement in `format.json` doesn't mean that
the drive needs to be reloaded into memory anymore.
fixes#10791
This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.
The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.
https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150
To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.
Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.
We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
- delete-marker should be created on a suspended bucket as `null`
- delete-marker should delete any pre-existing `null` versioned
object and create an entry `null`
It is possible in situations when server was deployed
in asymmetric configuration in the past such as
```
minio server ~/fs{1...4}/disk{1...5}
```
Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'
While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.
This PR fixes this behavior.
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
Advantages avoids 100's of stats which are needed for each
upload operation in FS/NAS gateway mode when uploading a large
multipart object, dramatically increases performance for
multipart uploads by avoiding recursive calls.
For other gateway's simplifies the approach since
azure, gcs, hdfs gateway's don't capture any specific
metadata during upload which needs handler validation
for encryption/compression.
Erasure coding was already optimized, additionally
just avoids small allocations of large data structure.
Fixes#7206
this is a major overhaul by migrating off all
bucket metadata related configs into a single
object '.metadata.bin' this allows us for faster
bootups across 1000's of buckets and as well
as keeps the code simple enough for future
work and additions.
Additionally also fixes#9396, #9394
This PR is to ensure that we call the relevant object
layer APIs for necessary S3 API level functionalities
allowing gateway implementations to return proper
errors as NotImplemented{}
This allows for all our tests in mint to behave
appropriately and can be handled appropriately as
well.
By monitoring PUT/DELETE and heal operations it is possible
to track changed paths and keep a bloom filter for this data.
This can help prioritize paths to scan. The bloom filter can identify
paths that have not changed, and the few collisions will only result
in a marginal extra workload. This can be implemented on either a
bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 1M
entries. A bloom table of this size is about ~2500 bytes when serialized.
To not force a full scan of all paths that have changed cycle bloom
filters would need to be kept, so we guarantee that dirty paths have
been scanned within cycle runs. Until cycle bloom filters have been
collected all paths are considered dirty.
As an optimization of the healing, HealObjects() avoid sending an
object to the background healing subsystem when the object is
present in all disks.
However, HealObjects() should have checked the scan type, if this
deep, always pass the object to the healing subsystem.
Change distributed locking to allow taking bulk locks
across objects, reduces usually 1000 calls to 1.
Also allows for situations where multiple clients sends
delete requests to objects with following names
```
{1,2,3,4,5}
```
```
{5,4,3,2,1}
```
will block and ensure that we do not fail the request
on each other.