It makes sense that a node that has multiple disks starts when one
disk fails, returning an i/o error for example. This commit will make this
faulty tolerance available in this specific use case.
Also adding an API to allow resyncing replication when
existing object replication is enabled and the remote target
is entirely lost. With the `mc replicate reset` command, the
objects that are eligible for replication as per the replication
config will be resynced to target if existing object replication
is enabled on the rule.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
This PR fixes two bugs
- Remove fi.Data upon overwrite of objects from inlined-data to non-inlined-data
- Workaround for an existing bug on disk with latest releases to ignore fi.Data
and instead read from the disk for non-inlined-data
- Addtionally add a reserved metadata header to indicate data is inlined for
a given version.
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
multi-disk clusters initialize buffer pools
per disk, this is perhaps expensive and perhaps
not useful, for a running server instance. As this
may disallow re-use of buffers across sets,
this change ensures that buffers across sets
can be re-used at drive level, this can reduce
quite a lot of memory on large drive setups.
gracefully start the server, if there are other drives
available - print enough information for administrator
to notice the errors in console.
Bonus: for really large streams use larger buffer for
writes.
A cache structure will be kept with a tree of usages.
The cache is a tree structure where each keeps track
of its children.
An uncompacted branch contains a count of the files
only directly at the branch level, and contains link to
children branches or leaves.
The leaves are "compacted" based on a number of properties.
A compacted leaf contains the totals of all files beneath it.
A leaf is only scanned once every dataUsageUpdateDirCycles,
rarer if the bloom filter for the path is clean and no lifecycles
are applied. Skipped leaves have their totals transferred from
the previous cycle.
A clean leaf will be included once every healFolderIncludeProb
for partial heal scans. When selected there is a one in
healObjectSelectProb that any object will be chosen for heal scan.
Compaction happens when either:
- The folder (and subfolders) contains less than dataScannerCompactLeastObject objects.
- The folder itself contains more than dataScannerCompactAtFolders folders.
- The folder only contains objects and no subfolders.
- A bucket root will never be compacted.
Furthermore, if a has more than dataScannerCompactAtChildren recursive
children (uncompacted folders) the tree will be recursively scanned and the
branches with the least number of objects will be compacted until the limit
is reached.
This ensures that any branch will never contain an unreasonable amount
of other branches, and also that small branches with few objects don't
take up unreasonable amounts of space.
Whenever a branch is scanned, it is assumed that it will be un-compacted
before it hits any of the above limits. This will make the branch rebalance
itself when scanned if the distribution of objects has changed.
TLDR; With current values: No bucket will ever have more than 10000
child nodes recursively. No single folder will have more than 2500 child
nodes by itself. All subfolders are compacted if they have less than 500
objects in them recursively.
We accumulate the (non-deletemarker) version count for paths as well,
since we are changing the structure anyway.
MRF does not detect when a node is disconnected and reconnected quickly
this change will ensure that MRF is alerted by comparing the last disk
reconnection timestamp with the last MRF check time.
Signed-off-by: Anis Elleuch <anis@min.io>
Co-authored-by: Klaus Post <klauspost@gmail.com>
as there is no automatic way to detect if there
is a root disk mounted on / or /var for the container
environments due to how the root disk information
is masked inside overlay root inside container.
this PR brings an environment variable to set
root disk size threshold manually to detect the
root disks in such situations.
Part ETags are not available after multipart finalizes, removing this
check as not useful.
Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.
Some new additions include:
- Admin APIs remote tier configuration management
- Simple journal to track remote objects to be 'collected'
This is used by object API handlers which 'mutate' object versions by
overwriting/replacing content (Put/CopyObject) or removing the version
itself (e.g DeleteObjectVersion).
- Rework of previous ILM transition to fit the new model
In the new model, a storage class (a.k.a remote tier) is defined by the
'remote' object storage type (one of s3, azure, GCS), bucket name and a
prefix.
* Fixed bugs, review comments, and more unit-tests
- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
EOF may be sent along with data so queue it up and
return it when the buffer is empty.
Also, when reading data without direct io don't add a buffer
that only results in extra memcopy.
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
current master breaks this important requirement
we need to preserve legacyXLv1 format, this is simply
ignored and overwritten causing a myriad of issues
by leaving stale files on the namespace etc.
for now lets still use the two-phase approach of
writing to `tmp` and then renaming the content to
the actual namespace.
versionID is the one that needs to be preserved and as
well as overwritten in case of replication, transition
etc - dataDir is an ephemeral entity that changes
during overwrites - make sure that versionID is used
to save the object content.
this would break things if you are already running
the latest master, please wipe your current content
and re-do your setup after this change.
using Lstat() is causing tiny memory allocations,
that are usually wasted and never used, instead
we can simply uses Access() call that does 0
memory allocations.
When an object is removed, its parent directory is inspected to check if
it is empty to remove if that is the case.
However, we can use os.Remove() directly since it is only able to remove
a file or an empty directory.
RenameData renames xl.meta and data dir and removes the parent directory
if empty, however, there is a duplicate check for empty dir, since the
parent dir of xl.meta is always the same as the data-dir.
on freshReads if drive returns errInvalidArgument, we
should simply turn-off DirectIO and read normally, there
are situations in k8s like environments where the drives
behave sporadically in a single deployment and may not
have been implemented properly to handle O_DIRECT for
reads.
This PR adds deadlines per Write() calls, such
that slow drives are timed-out appropriately and
the overall responsiveness for Writes() is always
up to a predefined threshold providing applications
sustained latency even if one of the drives is slow
to respond.
- write in o_dsync instead of o_direct for smaller
objects to avoid unaligned double Write() situations
that may arise for smaller objects < 128KiB
- avoid fallocate() as its not useful since we do not
use Append() semantics anymore, fallocate is not useful
for streaming I/O we can save on a syscall
- createFile() doesn't need to validate `bucket` name
with a Lstat() call since createFile() is only used
to write at `minioTmpBucket`
- use io.Copy() when writing unAligned writes to allow
usage of ReadFrom() from *os.File providing zero
buffer writes().
```
mc admin info --json
```
provides these details, for now, we shall eventually
expose this at Prometheus level eventually.
Co-authored-by: Harshavardhana <harsha@minio.io>
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo
also fixes bugs where there were double calls to StorageInfo()
There was an io.LimitReader was missing for the 'length'
parameter for ranged requests, that would cause client to
get truncated responses and errors.
fixes#11651
most of the delete calls today spend time in
a blocking operation where multiple calls need
to be recursively sent to delete the objects,
instead we can use rename operation to atomically
move the objects from the namespace to `tmp/.trash`
we can schedule deletion of objects at this
location once in 15, 30mins and we can also add
wait times between each delete operation.
this allows us to make delete's faster as well
less chattier on the drives, each server runs locally
a groutine which would clean this up regularly.
Instead of using O_SYNC, we are better off using O_DSYNC
instead since we are only ever interested in data to be
persisted to disk not the associated filesystem metadata.
For reads we ask customers to turn off noatime, but instead
we can proactively use O_NOATIME flag to avoid atime updates
upon reads.
root-disk implemented currently had issues where root
disk partitions getting modified might race and provide
incorrect results, to avoid this lets rely again back on
DeviceID and match it instead.
In-case of containers `/data` is one such extra entity that
needs to be verified for root disk, due to how 'overlay'
filesystem works and the 'overlay' presents a completely
different 'device' id - using `/data` as another entity
for fallback helps because our containers describe 'VOLUME'
parameter that allows containers to automatically have a
virtual `/data` that points to the container root path this
can either be at `/` or `/var/lib/` (on different partition)
reduce the page-cache pressure completely by moving
the entire read-phase of our operations to O_DIRECT,
primarily this is going to be very useful for chatty
metadata operations such as listing, scanner, ilm, healing
like operations to avoid filling up the page-cache upon
repeated runs.
Skip notifications on objects that might have had
an error during deletion, this also avoids unnecessary
replication attempt on such objects.
Refactor some places to make sure that we have notified
the client before we
- notify
- schedule for replication
- lifecycle etc.
during rolling upgrade, provide a more descriptive error
message and discourage rolling upgrade in such situations,
allowing users to take action.
additionally also rename `slashpath -> pathutil` to avoid
a slighly mis-pronounced usage of `path` package.
When a directory object is presented as a `prefix`
param our implementation tend to only list objects
present common to the `prefix` than the `prefix` itself,
to mimic AWS S3 like flat key behavior this PR ensures
that if `prefix` is directory object, it should be
automatically considered to be part of the eventual
listing result.
fixes#11370
- using miniogo.ObjectInfo.UserMetadata is not correct
- using UserTags from Map->String() can change order
- ContentType comparison needs to be removed.
- Compare both lowercase and uppercase key names.
- do not silently error out constructing PutObjectOptions
if tag parsing fails
- avoid notification for empty object info, failed operations
should rely on valid objInfo for notification in all
situations
- optimize copyObject implementation, also introduce a new
replication event
- clone ObjectInfo() before scheduling for replication
- add additional headers for comparison
- remove strings.EqualFold comparison avoid unexpected bugs
- fix pool based proxying with multiple pools
- compare only specific metadata
Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>
Currently, it is not possible to create a delete-marker when xl.meta
does not exist (no version is created for that object yet). This makes a
problem for replication and mc mirroring with versioning enabled.
This also follows S3 specification.
In-case user enables O_DIRECT for reads and backend does
not support it we shall proceed to turn it off instead
and print a warning. This validation avoids any unexpected
downtimes that users may incur.
Delete marker can have `metaSys` set to nil, that
can lead to crashes after the delete marker has
been healed.
Additionally also fix isObjectDangling check
for transitioned objects, that do not have parts
should be treated similar to Delete marker.
to verify moving content and preserving legacy content,
we have way to detect the objects through readdir()
this path is not necessary for most common cases on
newer setups, avoid readdir() to save multiple system
calls.
also fix the CheckFile behavior for most common
use case i.e without legacy format.
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.
Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.
Update tests to test multiple sets scenario
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.
This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.
If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
Use separate sync.Pool for writes/reads
Avoid passing buffers for io.CopyBuffer()
if the writer or reader implement io.WriteTo or io.ReadFrom
respectively then its useless for sync.Pool to allocate
buffers on its own since that will be completely ignored
by the io.CopyBuffer Go implementation.
Improve this wherever we see this to be optimal.
This allows us to be more efficient on memory usage.
```
385 // copyBuffer is the actual implementation of Copy and CopyBuffer.
386 // if buf is nil, one is allocated.
387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
388 // If the reader has a WriteTo method, use it to do the copy.
389 // Avoids an allocation and a copy.
390 if wt, ok := src.(WriterTo); ok {
391 return wt.WriteTo(dst)
392 }
393 // Similarly, if the writer has a ReadFrom method, use it to do the copy.
394 if rt, ok := dst.(ReaderFrom); ok {
395 return rt.ReadFrom(src)
396 }
```
From readahead package
```
// WriteTo writes data to w until there's no more data to write or when an error occurs.
// The return value n is the number of bytes written.
// Any error encountered during the write is also returned.
func (a *reader) WriteTo(w io.Writer) (n int64, err error) {
if a.err != nil {
return 0, a.err
}
n = 0
for {
err = a.fill()
if err != nil {
return n, err
}
n2, err := w.Write(a.cur.buffer())
a.cur.inc(n2)
n += int64(n2)
if err != nil {
return n, err
}
```
Rewrite parentIsObject() function. Currently if a client uploads
a/b/c/d, we always check if c, b, a are actual objects or not.
The new code will check with the reverse order and quickly quit if
the segment doesn't exist.
So if a, b, c in 'a/b/c' does not exist in the first place, then returns
false quickly.
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.
This is causing an extract call to the disk
which can be penalizing in case of busy system
where disks receive many concurrent access.
Fixes two problems
- Double healing when bitrot is enabled, instead heal attempt
once in applyActions() before lifecycle is applied.
- If applyActions() is successful and getSize() returns proper
value, then object is accounted for and should be removed
from the oldCache namespace map to avoid double heal attempts.
Tests environments (go test or manual testing) should always consider
the passed disks are root disks and should not rely on disk.IsRootDisk()
function. The reason is that this latter can return a false negative
when called in a busy system. However, returning a false negative will
only occur in a testing environment and not in a production, so we can
accept this trade-off for now.
till now we used to match the inode number of the root
drive and the drive path minio would use, if they match
we knew that its a root disk.
this may not be true in all situations such as running
inside a container environment where the container might
be mounted from a different partition altogether, root
disk detection might fail.
supports `mc admin config set <alias> heal sleep=100ms` to
enable more aggressive healing under certain times.
also optimize some areas that were doing extra checks than
necessary when bitrotscan was enabled, avoid double sleeps
make healing more predictable.
fixes#10497
X-Minio-Replication-Delete-Status header shows the
status of the replication of a permanent delete of a version.
All GETs are disallowed and return 405 on this object version.
In the case of replicating delete markers.
X-Minio-Replication-DeleteMarker-Status shows the status
of replication, and would similarly return 405.
Additionally, this PR adds reporting of delete marker event completion
and updates documentation
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).
This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.
Additionally, remove all unused walkers from the storage layer
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers
used by `io.CopyBuffer` even if they are pooled.
Since all writers appear to write byte buffers, just send those
instead and write directly. The files are opened through the `os`
package so they have no special properties anyway.
This removes the alloc and copy for each operation.
REST sends content length so a precise alloc can be made.