Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.
Optionally allows customers to enable
- Enable an external cache to catch GET/HEAD responses
- Enable skipping disks that are slow to respond in GET/HEAD
when we have already achieved a quorum
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
This PR adds a WebSocket grid feature that allows servers to communicate via
a single two-way connection.
There are two request types:
* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
roundtrips with small payloads.
* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
which allows for different combinations of full two-way streams with an initial payload.
Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.
Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.
If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte
slices or use a higher-level generics abstraction.
There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.
The request path can be changed to a new one for any protocol changes.
First, all servers create a "Manager." The manager must know its address
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.
```
func (m *Manager) Connection(host string) *Connection
```
All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight
requests and responses must also be given for streaming requests.
The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.
* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
performs a single request and returns the result. Any deadline provided on the request is
forwarded to the server, and canceling the context will make the function return at once.
* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
will initiate a remote call and send the initial payload.
```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
// Responses from the remote server.
// Channel will be closed after an error or when the remote closes.
// All responses *must* be read by the caller until either an error is returned or the channel is closed.
// Canceling the context will cause the context cancellation error to be returned.
Responses <-chan Response
// Requests sent to the server.
// If the handler is defined with 0 incoming capacity this will be nil.
// Channel *must* be closed to signal the end of the stream.
// If the request context is canceled, the stream will no longer process requests.
Requests chan<- []byte
}
type Response struct {
Msg []byte
Err error
}
```
There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.
this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
xl.meta gets written and never rolled back, however
we definitely need to validate the state that is
persisted on the disk, if there are inconsistencies
- more than write quorum we should return an error
to the client
- if write quorum was achieved however there are
inconsistent xl.meta's we should simply trigger
an MRF on them
Do completely independent multipart uploads.
In distributed mode, a lock was held to merge each multipart
upload as it was added. This lock was highly contested and
retries are expensive (timewise) in distributed mode.
Instead, each part adds its metadata information uniquely.
This eliminates the per object lock required for each to merge.
The metadata is read back and merged by "CompleteMultipartUpload"
without locks when constructing final object.
Co-authored-by: Harshavardhana <harsha@minio.io>
data shards were wrong due to a healing bug
reported in #13803 mainly with unaligned object
sizes.
This PR is an attempt to automatically avoid
these shards, with available information about
the `xl.meta` and actually disk mtime.
This unit allows users to limit the maximum number of noncurrent
versions of an object.
To enable this rule you need the following *ilm.json*
```
cat >> ilm.json <<EOF
{
"Rules": [
{
"ID": "test-max-noncurrent",
"Status": "Enabled",
"Filter": {
"Prefix": "user-uploads/"
},
"NoncurrentVersionExpiration": {
"MaxNoncurrentVersions": 5
}
}
]
}
EOF
mc ilm import myminio/mybucket < ilm.json
```
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
- avoid extra lookup for 'xl.meta' since we are
definitely sure that it doesn't exist.
- use this in newMultipartUpload() as well
- also additionally do not write with O_DSYNC
to avoid loading the drives, instead create
'xl.meta' for listing operations without
O_DSYNC since these are ephemeral objects.
- do the same with newMultipartUpload() since
it gets synced when the PutObjectPart() is
attempted, we do not need to tax newMultipartUpload()
instead.
we will allow situations such as
```
a/b/1.txt
a/b
```
and
```
a/b
a/b/1.txt
```
we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.
rest of the situations where the prefixes get
created across sets are supported as is.
Download files from *any* bucket/path as an encrypted zip file.
The key is included in the response but can be separated so zip
and the key doesn't have to be sent on the same channel.
Requires https://github.com/minio/pkg/pull/6
- Adds versioning support for S3 based remote tiers that have versioning
enabled. This ensures that when reading or deleting we specify the specific
version ID of the object. In case of deletion, this is important to ensure that
the object version is actually deleted instead of simply being marked for
deletion.
- Stores the remote object's version id in the tier-journal. Tier-journal file
version is not bumped up as serializing the new struct version is
compatible with old journals without the remote object version id.
- `storageRESTVersion` is bumped up as FileInfo struct now includes a
`TransitionRemoteVersionID` member.
- Azure and GCS support for this feature will be added subsequently.
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
This PR fixes two bugs
- Remove fi.Data upon overwrite of objects from inlined-data to non-inlined-data
- Workaround for an existing bug on disk with latest releases to ignore fi.Data
and instead read from the disk for non-inlined-data
- Addtionally add a reserved metadata header to indicate data is inlined for
a given version.
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
Previously we added heal trigger when bit-rot checks
failed, now extend that to support heal when parts
are not found either. This healing gets only triggered
if we can successfully decode the object i.e read
quorum is still satisfied for the object.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.
This is causing an extract call to the disk
which can be penalizing in case of busy system
where disks receive many concurrent access.
Fixes two problems
- Double healing when bitrot is enabled, instead heal attempt
once in applyActions() before lifecycle is applied.
- If applyActions() is successful and getSize() returns proper
value, then object is accounted for and should be removed
from the oldCache namespace map to avoid double heal attempts.
Do listings with prefix filter when bloom filter is dirty.
This will forward the prefix filter to the lister which will make it
only scan the folders/objects with the specified prefix.
If we have a clean bloom filter we try to build a more generally
useful cache so in that case, we will list all objects/folders.
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.
Additionally, remove all unused walkers from the storage layer
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.