deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
Removes RLock/RUnlock for updating metadata,
since we already take a write lock to update
metadata, this change removes reading of xl.meta
as well as an additional lock, the performance gain
should increase 3x theoretically for
- PutObjectRetention
- PutObjectLegalHold
This optimization is mainly for Veeam like
workloads that require a certain level of iops
from these API calls, we were losing iops.
offset+length should match the Size() of the individual parts
return 'errFileCorrupt' otherwise, to trigger healing of the individual
parts do not error out prematurely when healing such bitrot's upon
successful parts being written to the client.
another issue this PR fixes is to not return and error to
the client if we have just triggered a heal on a specific
part of the object, instead continue to read all the content
and let the heal happen asynchronously later.
This was a regression introduced in '14bb969782'
this has the potential to cause corruption when
there are concurrent overwrites attempting to update
the content on the namespace.
This PR adds a situation where PutObject(), CopyObject()
compete properly for the same locks with NewMultipartUpload()
however it ends up turning off competing locks for the actual
object with GetObject() and DeleteObject() - since they do not
compete due to concurrent I/O on a versioned bucket it can lead
to loss of versions.
This PR fixes this bug with multi-pool setup with replication
that causes corruption of inlined data due to lack of competing
locks in a multi-pool setup.
Instead CompleteMultipartUpload holds the necessary
locks when finishing the transaction, knowing the exact
location of an object to schedule the multipart upload
doesn't need to compete in this manner, a pool id location
for existing object.
* reduce extra getObjectInfo() calls during ILM transition
This PR also changes expiration logic to be non-blocking,
scanner is now free from additional costs incurred due
to slower object layer calls and hitting the drives.
* move verifying expiration inside locks
Use `readMetadata` when reading version
information without data requested.
Reduces IO on inlined data.
Bonus: Inline compressed data as well when
compression is enabled.
- avoid extra lookup for 'xl.meta' since we are
definitely sure that it doesn't exist.
- use this in newMultipartUpload() as well
- also additionally do not write with O_DSYNC
to avoid loading the drives, instead create
'xl.meta' for listing operations without
O_DSYNC since these are ephemeral objects.
- do the same with newMultipartUpload() since
it gets synced when the PutObjectPart() is
attempted, we do not need to tax newMultipartUpload()
instead.
removes unexpected features from regular putObject() such as
- increasing parity when disks are down, avoids
a lot of DiskInfo() calls.
- triggering MRF for metacache objects
if disks are offline
- avoiding renames from temporary location
to actual namespace, not needed since
metacache files are unique.
we will allow situations such as
```
a/b/1.txt
a/b
```
and
```
a/b
a/b/1.txt
```
we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.
rest of the situations where the prefixes get
created across sets are supported as is.
- delete-markers are incorrectly reported
as corrupt with wrong data sent to client
'mc admin heal -r' on objects with delete
marker will report as 'grey' incorrectly.
- do not heal delete-markers during HeadObject()
this can lead to inconsistent order of heals
on the object, although this is not an issue
in terms of order of versions it is rather
simpler to keep the same order on all drives.
- defaultHealResult() should handle 'err == nil'
case such that valid cases should be handled
as 'drive' status OK.
This allows remote bucket admin to identify the origin of transitioned
objects by simply inspecting the object prefixes.
e.g let's take a remote tier TIER-1 pointing to a remote bucket (prefix)
testbucket/testprefix-1. The remote bucket admin can list all transitioned objects
from a MinIO deployment identified by '2e78e906-1c5d-4f94-8689-9df44cafde39' and
source bucket 'mybucket' like so,
```
$ ./mc ls -r minio-tier-target/testbucket/testprefix-1/2e78e906-1c5d-4f94-8689-9df44cafde39/mybucket/
[2021-07-12 17:15:50 PDT] 160B 48/fb/48fbc0e6-3a73-458b-9337-8e722c619ca4
[2021-07-12 16:58:46 PDT] 160B 7d/1c/7d1c96bd-031a-48d4-99ea-b1304e870830
```
This commit gathers MRF metrics from
all nodes in a cluster and return it to the caller. This will show information about the
number of objects in the MRF queues
waiting to be healed.
DiskInfo() calls can stagger and wait if run
serially timing out 10secs per drive, to avoid
this lets check DiskInfo in parallel to avoid
delays when nodes get disconnected.
- Adds versioning support for S3 based remote tiers that have versioning
enabled. This ensures that when reading or deleting we specify the specific
version ID of the object. In case of deletion, this is important to ensure that
the object version is actually deleted instead of simply being marked for
deletion.
- Stores the remote object's version id in the tier-journal. Tier-journal file
version is not bumped up as serializing the new struct version is
compatible with old journals without the remote object version id.
- `storageRESTVersion` is bumped up as FileInfo struct now includes a
`TransitionRemoteVersionID` member.
- Azure and GCS support for this feature will be added subsequently.
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
In cases where a cluster is degraded, we do not uphold our consistency
guarantee and we will write fewer erasure codes and rely on healing
to recreate the missing shards.
In some cases replacing known bad disks in practice take days.
We want to change the behavior of a known degraded system to keep
the erasure code promise of the storage class for each object.
This will create the objects with the same confidence as a fully
functional cluster. The tradeoff will be that objects created
during a partial outage will take up slightly more space.
This means that when the storage class is EC:4, there should
always be written 4 parity shards, even if some disks are unavailable.
When an object is created on a set, the disks are immediately
checked. If any disks are unavailable additional parity shards
will be made for each offline disk, up to 50% of the number of disks.
We add an internal metadata field with the actual and intended
erasure code level, this can optionally be picked up later by
the scanner if we decide that data like this should be re-sharded.
This PR fixes two bugs
- Remove fi.Data upon overwrite of objects from inlined-data to non-inlined-data
- Workaround for an existing bug on disk with latest releases to ignore fi.Data
and instead read from the disk for non-inlined-data
- Addtionally add a reserved metadata header to indicate data is inlined for
a given version.
However, this slice is also used for closing the writers, so close is never called on these.
Furthermore when an error is returned from a write it is now reported to the reader.
bonus: remove unused heal param from `newBitrotWriter`.
* Remove copy, now that we don't mutate.
At some places bloom filter tracker was getting
updated for `.minio.sys/tmp` bucket, there is no
reason to update bloom filters for those.
And add a missing bloom filter update for MakeBucket()
Bonus: purge unused function deleteEmptyDir()
- GetObject() should always use a common dataDir to
read from when it starts reading, this allows the
code in erasure decoding to have sane expectations.
- Healing should always heal on the common dataDir, this
allows the code in dangling object detection to purge
dangling content.
These both situations can happen under certain types of
retries during PUT when server is restarting etc, some
namespace entries might be left over.
wait groups are necessary with io.Pipes() to avoid
races when a blocking function may not be expected
and a Write() -> Close() before Read() races on each
other. We should avoid such situations..
Co-authored-by: Klaus Post <klauspost@gmail.com>
cleanup functions should never be cleaned before the reader is
instantiated, this type of design leads to situations where order
of lockers and places for them to use becomes confusing.
Allow WithCleanupFuncs() if the caller wishes to add cleanupFns
to be run upon close() or an error during initialization of the
reader.
Also make sure streams are closed before we unlock the resources,
this allows for ordered cleanup of resources.
upon errors to acquire lock context would still leak,
since the cancel would never be called. since the lock
is never acquired - proactively clear it before returning.
* lock: Always cancel the returned Get(R)Lock context
There is a leak with cancel created inside the locking mechanism. The
cancel purpose was to cancel operations such erasure get/put that are
holding non-refreshable locks.
This PR will ensure the created context.Cancel is passed to the unlock
API so it will cleanup and avoid leaks.
* locks: Avoid returning nil cancel in local lockers
Since there is no Refresh mechanism in the local locking mechanism, we
do not generate a new context or cancel. Currently, a nil cancel
function is returned but this can cause a crash. Return a dummy function
instead.
Part ETags are not available after multipart finalizes, removing this
check as not useful.
Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
avoid re-read of xl.meta instead just use
the success criteria from PutObjectPart()
and check the ETag matches per Part, if
they match then the parts have been
successfully restored as is.
Signed-off-by: Harshavardhana <harsha@minio.io>
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.
Some new additions include:
- Admin APIs remote tier configuration management
- Simple journal to track remote objects to be 'collected'
This is used by object API handlers which 'mutate' object versions by
overwriting/replacing content (Put/CopyObject) or removing the version
itself (e.g DeleteObjectVersion).
- Rework of previous ILM transition to fit the new model
In the new model, a storage class (a.k.a remote tier) is defined by the
'remote' object storage type (one of s3, azure, GCS), bucket name and a
prefix.
* Fixed bugs, review comments, and more unit-tests
- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
* fix: pick valid FileInfo additionally based on dataDir
historically we have always relied on modTime
to be consistent and same, we can now add additional
reference to look for the same dataDir value.
A dataDir is the same for an object at a given point in
time for a given version, let's say a `null` version
is overwritten in quorum we do not by mistake pick
up the fileInfo's incorrectly.
* make sure to not preserve fi.Data
Signed-off-by: Harshavardhana <harsha@minio.io>
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
current master breaks this important requirement
we need to preserve legacyXLv1 format, this is simply
ignored and overwritten causing a myriad of issues
by leaving stale files on the namespace etc.
for now lets still use the two-phase approach of
writing to `tmp` and then renaming the content to
the actual namespace.
versionID is the one that needs to be preserved and as
well as overwritten in case of replication, transition
etc - dataDir is an ephemeral entity that changes
during overwrites - make sure that versionID is used
to save the object content.
this would break things if you are already running
the latest master, please wipe your current content
and re-do your setup after this change.
This PR adds deadlines per Write() calls, such
that slow drives are timed-out appropriately and
the overall responsiveness for Writes() is always
up to a predefined threshold providing applications
sustained latency even if one of the drives is slow
to respond.
Some deployments have low parity (EC:2), but we really do not need to
save our config data with the same parity configuration.
N/2 would be better to keep MinIO configurations intact when unexpected
a number of drives fail.
top-level options shouldn't be passed down for
GetObjectInfo() while verifying the objects in
different pools, this is to make sure that
we always get the value from the pool where
the object exists.
- using miniogo.ObjectInfo.UserMetadata is not correct
- using UserTags from Map->String() can change order
- ContentType comparison needs to be removed.
- Compare both lowercase and uppercase key names.
- do not silently error out constructing PutObjectOptions
if tag parsing fails
- avoid notification for empty object info, failed operations
should rely on valid objInfo for notification in all
situations
- optimize copyObject implementation, also introduce a new
replication event
- clone ObjectInfo() before scheduling for replication
- add additional headers for comparison
- remove strings.EqualFold comparison avoid unexpected bugs
- fix pool based proxying with multiple pools
- compare only specific metadata
Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>
Previously we added heal trigger when bit-rot checks
failed, now extend that to support heal when parts
are not found either. This healing gets only triggered
if we can successfully decode the object i.e read
quorum is still satisfied for the object.
parentDirIsObject is not using set level understanding
to check for parent objects, without this it can lead to
objects that can actually reside on a separate set as
objects and would conflict.
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.
This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.
If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
Synchronous replication can be enabled by setting the --sync
flag while adding a remote replication target.
This PR also adds proxying on GET/HEAD to another node in a
active-active replication setup in the event of a 404 on the current node.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.
This is causing an extract call to the disk
which can be penalizing in case of busy system
where disks receive many concurrent access.
Optimizations include
- do not write the metacache block if the size of the
block is '0' and it is the first block - where listing
is attempted for a transient prefix, this helps to
avoid creating lots of empty metacache entries for
`minioMetaBucket`
- avoid the entire initialization sequence of cacheCh
, metacacheBlockWriter if we are simply going to skip
them when discardResults is set to true.
- No need to hold write locks while writing metacache
blocks - each block is unique, per bucket, per prefix
and also is written by a single node.
Additional cases handled
- fix address situations where healing is not
triggered on failed writes and deletes.
- consider object exists during listing when
metadata can be successfully decoded.
X-Minio-Replication-Delete-Status header shows the
status of the replication of a permanent delete of a version.
All GETs are disallowed and return 405 on this object version.
In the case of replicating delete markers.
X-Minio-Replication-DeleteMarker-Status shows the status
of replication, and would similarly return 405.
Additionally, this PR adds reporting of delete marker event completion
and updates documentation