This commit fixes a bug in the put-part
implementation. The SSE headers should be
set as specified by AWS - See:
https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html
Now, the MinIO server should set SSE-C headers,
like `x-amz-server-side-encryption-customer-algorithm`.
Fixes#11991
locks can get relinquished when Read() sees io.EOF
leading to prematurely closing of the readers
concurrent writes on the same object can have
undesired consequences here when these locks
are relinquished.
EOF may be sent along with data so queue it up and
return it when the buffer is empty.
Also, when reading data without direct io don't add a buffer
that only results in extra memcopy.
Multiple disks from the same set would be writing concurrently.
```
WARNING: DATA RACE
Write at 0x00c002100ce0 by goroutine 166:
github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
d:/minio/minio/cmd/erasure-sets.go:254 +0x82f
Previous write at 0x00c002100ce0 by goroutine 129:
github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
d:/minio/minio/cmd/erasure-sets.go:254 +0x82f
Goroutine 166 (running) created at:
github.com/minio/minio/cmd.(*erasureSets).connectDisks()
d:/minio/minio/cmd/erasure-sets.go:210 +0x324
github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
d:/minio/minio/cmd/erasure-sets.go:288 +0x244
Goroutine 129 (finished) created at:
github.com/minio/minio/cmd.(*erasureSets).connectDisks()
d:/minio/minio/cmd/erasure-sets.go:210 +0x324
github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
d:/minio/minio/cmd/erasure-sets.go:288 +0x244
```
This commit adds a self-test for all bitrot algorithms:
- SHA-256
- BLAKE2b
- HighwayHash
The self-test computes an incremental checksum of pseudo-random
messages. If a bitrot algorithm implementation stops working on
some CPU architecture or with a certain Go version this self-test
will prevent the server from starting and silently corrupting data.
For additional context see: minio/highwayhash#19
Metrics calculation was accumulating inital usage across all nodes
rather than using initial usage only once.
Also fixing:
- bug where all peer traffic was going to the same node.
- reset counters when replication status changes from
PENDING -> FAILED
This PR fixes
- close leaking bandwidth report channel leakage
- remove the closer requirement for bandwidth monitor
instead if Read() fails remember the error and return
error for all subsequent reads.
- use locking for usage-cache.bin updates, with inline
data we cannot afford to have concurrent writes to
usage-cache.bin corrupting xl.meta
implementation in #11949 only catered from single
node, but we need cluster metrics by capturing
from all peers. introduce bucket stats API that
will be used for capturing in-line bucket usage
as well eventually
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
current master breaks this important requirement
we need to preserve legacyXLv1 format, this is simply
ignored and overwritten causing a myriad of issues
by leaving stale files on the namespace etc.
for now lets still use the two-phase approach of
writing to `tmp` and then renaming the content to
the actual namespace.
versionID is the one that needs to be preserved and as
well as overwritten in case of replication, transition
etc - dataDir is an ephemeral entity that changes
during overwrites - make sure that versionID is used
to save the object content.
this would break things if you are already running
the latest master, please wipe your current content
and re-do your setup after this change.
upgrading from 2yr old releases is expected to work,
the issue was we were missing checksum info to be
passed down to newBitrotReader() for whole bitrot
calculation
Ensure that we don't use potentially broken algorithms for critical functions, whether it be a runtime problem or implementation problem for a specific platform.
It is inefficient to decide to heal an object before checking its
lifecycle for expiration or transition. This commit will just reverse
the order of action: evaluate lifecycle and heal only if asked and
lifecycle resulted a NoneAction.
replication didn't work as expected when deletion of
delete markers was requested in DeleteMultipleObjects
API, this is due to incorrect lookup elements being
used to look for delete markers.
This allows us to speed up or slow down sleeps
between multiple scanner cycles, helps in testing
as well as some deployments might want to run
scanner more frequently.
This change is also dynamic can be applied on
a running cluster, subsequent cycles pickup
the newly set value.
using Lstat() is causing tiny memory allocations,
that are usually wasted and never used, instead
we can simply uses Access() call that does 0
memory allocations.
This feature brings in support for auto extraction
of objects onto MinIO's namespace from an incoming
tar gzipped stream, the only expected metadata sent
by the client is to set `snowball-auto-extract`.
All the contents from the tar stream are saved as
folders and objects on the namespace.
fixes#8715
service accounts were not inheriting parent policies
anymore due to refactors in the PolicyDBGet() from
the latest release, fix this behavior properly.
The local node name is heavily used in tracing, create a new global
variable to store it. Multiple goroutines can access it since it won't be
changed later.
In #11888 we observe a lot of running, WalkDir calls.
There doesn't appear to be any listerners for these calls, so they should be aborted.
Ensure that WalkDir aborts when upstream cancels the request.
Fixes#11888
The background healing can return NoSuchUpload error, the reason is that
healing code can return errFileNotFound with three parameters. Simplify
the code by returning exact errUploadNotFound error in multipart code.
Also ensure that a typed error is always returned whatever the number of
parameters because it is better than showing internal error.
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
baseDirFromPrefix(prefix) for object names without
parent directory incorrectly uses empty path, leading
to long listing at various paths that are not useful
for healing - avoid this listing completely if "baseDir"
returns empty simple use the "prefix" as is.
this improves startup performance significantly
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
some SDKs might incorrectly send duplicate
entries for keys such as "conditions", Go
stdlib unmarshal for JSON does not support
duplicate keys - instead skips the first
duplicate and only preserves the last entry.
This can lead to issues where a policy JSON
while being valid might not properly apply
the required conditions, allowing situations
where POST policy JSON would end up allowing
uploads to unauthorized buckets and paths.
This PR fixes this properly.
This commit adds a `MarshalText` implementation
to the `crypto.Context` type.
The `MarshalText` implementation replaces the
`WriteTo` and `AppendTo` implementation.
It is slightly slower than the `AppendTo` implementation
```
goos: darwin
goarch: arm64
pkg: github.com/minio/minio/cmd/crypto
BenchmarkContext_AppendTo/0-elems-8 381475698 2.892 ns/op 0 B/op 0 allocs/op
BenchmarkContext_AppendTo/1-elems-8 17945088 67.54 ns/op 0 B/op 0 allocs/op
BenchmarkContext_AppendTo/3-elems-8 5431770 221.2 ns/op 72 B/op 2 allocs/op
BenchmarkContext_AppendTo/4-elems-8 3430684 346.7 ns/op 88 B/op 2 allocs/op
```
vs.
```
BenchmarkContext/0-elems-8 135819834 8.658 ns/op 2 B/op 1 allocs/op
BenchmarkContext/1-elems-8 13326243 89.20 ns/op 128 B/op 1 allocs/op
BenchmarkContext/3-elems-8 4935301 243.1 ns/op 200 B/op 3 allocs/op
BenchmarkContext/4-elems-8 2792142 428.2 ns/op 504 B/op 4 allocs/op
goos: darwin
```
However, the `AppendTo` benchmark used a pre-allocated buffer. While
this improves its performance it does not match the actual usage of
`crypto.Context` which is passed to a `KMS` and always encoded into
a newly allocated buffer.
Therefore, this change seems acceptable since it should not impact the
actual performance but reduces the overall code for Context marshaling.
When an object is removed, its parent directory is inspected to check if
it is empty to remove if that is the case.
However, we can use os.Remove() directly since it is only able to remove
a file or an empty directory.
RenameData renames xl.meta and data dir and removes the parent directory
if empty, however, there is a duplicate check for empty dir, since the
parent dir of xl.meta is always the same as the data-dir.
on freshReads if drive returns errInvalidArgument, we
should simply turn-off DirectIO and read normally, there
are situations in k8s like environments where the drives
behave sporadically in a single deployment and may not
have been implemented properly to handle O_DIRECT for
reads.
This PR adds deadlines per Write() calls, such
that slow drives are timed-out appropriately and
the overall responsiveness for Writes() is always
up to a predefined threshold providing applications
sustained latency even if one of the drives is slow
to respond.
MRF was starting to heal when it receives a disk connection event, which
is not good when a node having multiple disks reconnects to the cluster.
Besides, MRF needs Remove healing option to remove stale files.
- write in o_dsync instead of o_direct for smaller
objects to avoid unaligned double Write() situations
that may arise for smaller objects < 128KiB
- avoid fallocate() as its not useful since we do not
use Append() semantics anymore, fallocate is not useful
for streaming I/O we can save on a syscall
- createFile() doesn't need to validate `bucket` name
with a Lstat() call since createFile() is only used
to write at `minioTmpBucket`
- use io.Copy() when writing unAligned writes to allow
usage of ReadFrom() from *os.File providing zero
buffer writes().
```
mc admin info --json
```
provides these details, for now, we shall eventually
expose this at Prometheus level eventually.
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit fixes a security issue in the signature v4 chunked
reader. Before, the reader returned unverified data to the caller
and would only verify the chunk signature once it has encountered
the end of the chunk payload.
Now, the chunk reader reads the entire chunk into an in-memory buffer,
verifies the signature and then returns data to the caller.
In general, this is a common security problem. We verifying data
streams, the verifier MUST NOT return data to the upper layers / its
callers as long as it has not verified the current data chunk / data
segment:
```
func (r *Reader) Read(buffer []byte) {
if err := r.readNext(r.internalBuffer); err != nil {
return err
}
if err := r.verify(r.internalBuffer); err != nil {
return err
}
copy(buffer, r.internalBuffer)
}
```
For operations that require the object to exist make it possible to
detect if the file isn't found in *any* pool.
This will allow these to return the error early without having to re-check.
Cases where we have applications making request
for `//` in object names make sure that all
are normalized to `/` and all such requests that
are prefixed '/' are removed. To ensure a
consistent view from all operations.
Some deployments have low parity (EC:2), but we really do not need to
save our config data with the same parity configuration.
N/2 would be better to keep MinIO configurations intact when unexpected
a number of drives fail.
This commit disables the Hashicorp Vault
support but provides a way to temp. enable
it via the `MINIO_KMS_VAULT_DEPRECATION=off`
Vault support has been deprecated long ago
and this commit just requires users to take
action if they maintain a Vault integration.
major performance improvements in range GETs to avoid large
read amplification when ranges are tiny and random
```
-------------------
Operation: GET
Operations: 142014 -> 339421
Duration: 4m50s -> 4m56s
* Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s
* Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s
* 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s
* Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s
```
TTFB from 10MiB BlockSize
```
* First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s
```
TTFB from 1MiB BlockSize
```
* First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms
```
Full object reads however do see a slight change which won't be
noticeable in real world, so not doing any comparisons
TTFB still had improvements with full object reads with 1MiB
```
* First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s
```
v/s
TTFB with 10MiB
```
* First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s
```
This change should affect all new uploads, previous uploads should
continue to work with business as usual. But dramatic improvements can
be seen with these changes.
A group can have multiple policies, a user subscribed to readwrite &
diagnostics can perform S3 operations & admin operations as well.
However, the current code only returns one policy for one group.
This commit disables SHA-3 for OpenID when building a
FIPS-140 2 compatible binary. While SHA-3 is a
crypto. hash function accepted by NIST there is no
FIPS-140 2 compliant implementation available when
using the boringcrypto Go branch.
Therefore, SHA-3 must not be used when building
a FIPS-140 2 binary.
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
currently when one of the peer is down, the
drives from that peer are reported as '0/0'
offline instead we should capture/filter the
drives from the peer and populate it appropriately
such that `mc admin info` displays correct info.
This commit adds the `FromContentMD5` function to
parse a client-provided content-md5 as ETag.
Further, it also adds multipart ETag computation
for future needs.
prometheus metrics was using total disks instead
of online disk count, when disks were down, this
PR fixes this and also adds a new metric for
total_disk_count
Creating notification events for replica creation
is not particularly useful to send as the notification
event generated at source already includes replication
completion events.
For applications using replica cluster as failover, avoiding
duplicate notifications for replica event will allow seamless
failover.
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo
also fixes bugs where there were double calls to StorageInfo()
While starting up a request that needs all IAM data will start another load operation if the first on startup hasn't finished. This slows down both operations.
Block these requests until initial load has completed.
Blocking calls will be ListPolicies, ListUsers, ListServiceAccounts, ListGroups - and the calls that eventually trigger these. These will wait for the initial load to complete.
Fixes issue seen in #11305
Implicit permissions for any user is to be allowed to
change their own password, we need to restrict this
further even if there is an implicit allow for this
scenario - we have to honor Deny statements if they
are specified.
ListObjectVersions would skip past the object in the marker when version id is specified.
Make `listPath` return the object with the marker and truncate it if not needed.
Avoid having to parse unintended objects to find a version marker.
The previous code was iterating over replies from peers and assigning
pool numbers to them, thus missing to add it for the local server.
Fixed by iterating over the server properties of all the servers
including the local one.
There was an io.LimitReader was missing for the 'length'
parameter for ranged requests, that would cause client to
get truncated responses and errors.
fixes#11651
The base profiles contains no valuable data, don't record them.
Reduce block rate by 2 orders of magnitude, should still capture just as valuable data with less CPU strain.
most of the delete calls today spend time in
a blocking operation where multiple calls need
to be recursively sent to delete the objects,
instead we can use rename operation to atomically
move the objects from the namespace to `tmp/.trash`
we can schedule deletion of objects at this
location once in 15, 30mins and we can also add
wait times between each delete operation.
this allows us to make delete's faster as well
less chattier on the drives, each server runs locally
a groutine which would clean this up regularly.
This commit removes the `GetObject` method
from the `ObjectLayer` interface.
The `GetObject` method is not longer used by
the HTTP handlers implementing the high-level
S3 semantics. Instead, they use the `GetObjectNInfo`
method which returns both, an object handle as well
as the object metadata.
Therefore, it is no longer necessary that a concrete
`ObjectLayer` implements `GetObject`.
store the cache in-memory instead of disks to avoid large
write amplifications for list heavy workloads, store in
memory instead and let it auto expire.
This commit replaces the usage of
github.com/minio/sha256-simd with crypto/sha256
of the standard library in all non-performance
critical paths.
This is necessary for FIPS 140-2 compliance which
requires that all crypto. primitives are implemented
by a FIPS-validated module.
Go can use the Google FIPS module. The boringcrypto
branch of the Go standard library uses the BoringSSL
FIPS module to implement crypto. primitives like AES
or SHA256.
We only keep github.com/minio/sha256-simd when computing
the content-SHA256 of an object. Therefore, this commit
relies on a build tag `fips`.
When MinIO is compiled without the `fips` flag it will
use github.com/minio/sha256-simd. When MinIO is compiled
with the fips flag (go build --tags "fips") then MinIO
uses crypto/sha256 to compute the content-SHA256.
Instead of using O_SYNC, we are better off using O_DSYNC
instead since we are only ever interested in data to be
persisted to disk not the associated filesystem metadata.
For reads we ask customers to turn off noatime, but instead
we can proactively use O_NOATIME flag to avoid atime updates
upon reads.
This removes the Content-MD5 response header on Range requests in Azure
Gateway mode. The partial content MD5 doesn't match the full object MD5
in metadata.
This commit adds a new package `etag` for dealing
with S3 ETags.
Even though ETag is often viewed as MD5 checksum of
an object, handling S3 ETags correctly is a surprisingly
complex task. While it is true that the ETag corresponds
to the MD5 for the most basic S3 API operations, there are
many exceptions in case of multipart uploads or encryption.
In worse, some S3 clients expect very specific behavior when
it comes to ETags. For example, some clients expect that the
ETag is a double-quoted string and fail otherwise.
Non-AWS compliant ETag handling has been a source of many bugs
in the past.
Therefore, this commit adds a dedicated `etag` package that provides
functionality for parsing, generating and converting S3 ETags.
Further, this commit removes the ETag computation from the `hash`
package. Instead, the `hash` package (i.e. `hash.Reader`) should
focus only on computing and verifying the content-sha256.
One core feature of this commit is to provide a mechanism to
communicate a computed ETag from a low-level `io.Reader` to
a high-level `io.Reader`.
This problem occurs when an S3 server receives a request and
has to compute the ETag of the content. However, the server
may also wrap the initial body with several other `io.Reader`,
e.g. when encrypting or compressing the content:
```
reader := Encrypt(Compress(ETag(content)))
```
In such a case, the ETag should be accessible by the high-level
`io.Reader`.
The `etag` provides a mechanism to wrap `io.Reader` implementations
such that the `ETag` can be accessed by a type-check.
This technique is applied to the PUT, COPY and Upload handlers.
server startup code expects the object layer to properly
convert error into a proper type, so that in situations when
servers are coming up and quorum is not available servers
wait on each other.
using isServerResolvable for expiration can lead to chicken
and egg problems, a lock might expire knowingly when server
is booting up causing perpetual locks getting expired.
- In username search filter and username format variables we support %s for
replacing with the username.
- In group search filter we support %s for username and %d for the full DN of
the username.
root-disk implemented currently had issues where root
disk partitions getting modified might race and provide
incorrect results, to avoid this lets rely again back on
DeviceID and match it instead.
In-case of containers `/data` is one such extra entity that
needs to be verified for root disk, due to how 'overlay'
filesystem works and the 'overlay' presents a completely
different 'device' id - using `/data` as another entity
for fallback helps because our containers describe 'VOLUME'
parameter that allows containers to automatically have a
virtual `/data` that points to the container root path this
can either be at `/` or `/var/lib/` (on different partition)
also additionally make sure errors during deserializer closes
the reader with right error type such that Write() end
actually see the final error, this avoids a waitGroup usage
and waiting.
reduce the page-cache pressure completely by moving
the entire read-phase of our operations to O_DIRECT,
primarily this is going to be very useful for chatty
metadata operations such as listing, scanner, ilm, healing
like operations to avoid filling up the page-cache upon
repeated runs.
since we have changed our default envs to MINIO_ROOT_USER,
MINIO_ROOT_PASSWORD this was not supported by minio-go
credentials package, update minio-go to v7.0.10 for this
support. This also addresses few bugs related to users
had to specify AWS_ACCESS_KEY_ID as well to authenticate
with their S3 backend if they only used MINIO_ROOT_USER.
We can use this metric to check if there are too many S3 clients in the
queue and could explain why some of those S3 clients are timing out.
```
minio_s3_requests_waiting_total{server="127.0.0.1:9000"} 9981
```
If max_requests is 10000 then there is a strong possibility that clients
are timing out because of the queue deadline.
If the periodic `case <-t.C:` save gets held up for a long time it will end up
synchronize all disk writes for saving the caches.
We add jitter to per set writes so they don't sync up and don't hold a
lock for the write, since it isn't needed anyway.
If an outage prevents writes for a long while we also add individual
waits for each disk in case there was a queue.
Furthermore limit the number of buffers kept to 2GiB, since this could get
huge in large clusters. This will not act as a hard limit but should be enough
for normal operation.
in the case of active-active replication.
This PR also has the following changes:
- add docs on replication design
- fix corner case of completing versioned delete on a delete marker
when the target is down and `mc rm --vid` is performed repeatedly. Instead
the version should still be retained in the `PENDING|FAILED` state until
replication sync completes.
- remove `s3:Replication:OperationCompletedReplication` and
`s3:Replication:OperationFailedReplication` from ObjectCreated
events type
currently crawler waits for an entire readdir call to
return until it processes usage, lifecycle, replication
and healing - instead we should pass the applicator all
the way down to avoid building any special stack for all
the contents in a single directory.
This allows for
- no need to remember the entire list of entries per directory
before applying the required functions
- no need to wait for entire readdir() call to finish before
applying the required functions
This PR fixes
- allow 's3:versionid` as a valid conditional for
Get,Put,Tags,Object locking APIs
- allow additional headers missing for object APIs
- allow wildcard based action matching
additionally simply timedValue to have RWMutex
to avoid concurrent calls to DiskInfo() getting
serialized, this has an effect on all calls that
use GetDiskInfo() on the same disks.
Such as getOnlineDisks, getOnlineDisksWithoutHealing
The function used for getting host information
(host.SensorsTemperaturesWithContext) returns warnings in some cases.
Returning with error in such cases means we miss out on the other useful
information already fetched (os info).
If the OS info has been succesfully fetched, it should always be
included in the output irrespective of whether the other data (CPU
sensors, users) could be fetched or not.
To avoid large delays in metacache cleanup, use rename
instead of recursive delete calls, renames are cheaper
move the content to minioMetaTmpBucket and then cleanup
this folder once in 24hrs instead.
If the new cache can replace an existing one, we should
let it replace since that is currently being saved anyways,
this avoids pile up of 1000's of metacache entires for
same listing calls that are not necessary to be stored
on disk.
Skip notifications on objects that might have had
an error during deletion, this also avoids unnecessary
replication attempt on such objects.
Refactor some places to make sure that we have notified
the client before we
- notify
- schedule for replication
- lifecycle etc.
continuation of PR#11491 for multiple server pools and
bi-directional replication.
Moving proxying for GET/HEAD to handler level rather than
server pool layer as this was also causing incorrect proxying
of HEAD.
Also fixing metadata update on CopyObject - minio-go was not passing
source version ID in X-Amz-Copy-Source header
filter out relevant objects for each pool to
avoid calling, further delete operations on
subsequent pools where some of these objects
might not exist.
This is mainly useful to avoid situations
during bi-directional bucket replication.
top-level options shouldn't be passed down for
GetObjectInfo() while verifying the objects in
different pools, this is to make sure that
we always get the value from the pool where
the object exists.
This change moves away from a unified constructor for plaintext and encrypted
usage. NewPutObjReader is simplified for the plain-text reader use. For
encrypted reader use, WithEncryption should be called on an initialized PutObjReader.
Plaintext:
func NewPutObjReader(rawReader *hash.Reader) *PutObjReader
The hash.Reader is used to provide payload size and md5sum to the downstream
consumers. This is different from the previous version in that there is no need
to pass nil values for unused parameters.
Encrypted:
func WithEncryption(encReader *hash.Reader,
key *crypto.ObjectKey) (*PutObjReader, error)
This method sets up encrypted reader along with the key to seal the md5sum
produced by the plain-text reader (already setup when NewPutObjReader was
called).
Usage:
```
pReader := NewPutObjReader(rawReader)
// ... other object handler code goes here
// Prepare the encrypted hashed reader
pReader, err = pReader.WithEncryption(encReader, objEncKey)
```
Collection of SMART information doesn't work in certain scenarios e.g.
in a container based setup. In such cases, instead of returning an error
(without any data), we should only set the error on the smartinfo
struct, so that other important drive hw info like device, mountpoint,
etc is retained in the output.
during rolling upgrade, provide a more descriptive error
message and discourage rolling upgrade in such situations,
allowing users to take action.
additionally also rename `slashpath -> pathutil` to avoid
a slighly mis-pronounced usage of `path` package.
When lifecycle decides to Delete an object and not a version in a
versioned bucket, the code should create a delete marker and not
removing the scanned version.
This commit fixes the issue.
The connections info of the processes takes up a huge amount of space,
and is not important for adding any useful health checks. Removing it
will significantly reduce the size of the subnet health report.
- lock maintenance loop was incorrectly sleeping
as well as using ticker badly, leading to
extra expiration routines getting triggered
that could flood the network.
- multipart upload cleanup should be based on
timer instead of ticker, to ensure that long
running jobs don't get triggered twice.
- make sure to get right lockers for object name
Replaces #11449
Does concurrent healing but limits concurrency to 50 buckets.
Aborts on first error.
`errgroup.Group` is extended to facilitate this in a generic way.
We use multiple libraries in health info, but the returned error does
not indicate exactly what library call is failing, hence adding named
tags to returned errors whenever applicable.
After recent refactor where lifecycle started to rely on ObjectInfo to
make decisions, it turned out there are some issues calculating
Successor Modtime and NumVersions, hence the lifecycle is not working as
expected in a versioning bucket in some cases.
This commit fixes the behavior.
When a directory object is presented as a `prefix`
param our implementation tend to only list objects
present common to the `prefix` than the `prefix` itself,
to mimic AWS S3 like flat key behavior this PR ensures
that if `prefix` is directory object, it should be
automatically considered to be part of the eventual
listing result.
fixes#11370
few places were still using legacy call GetObject()
which was mainly designed for client response writer,
use GetObjectNInfo() for internal calls instead.
for some flaky networks this may be too fast of a value
choose a defensive value, and let this be addressed
properly in a new refactor of dsync with renewal logic.
Also enable faster fallback delay to cater for misconfigured
IPv6 servers
refer
- https://golang.org/pkg/net/#Dialer
- https://tools.ietf.org/html/rfc6555
- using miniogo.ObjectInfo.UserMetadata is not correct
- using UserTags from Map->String() can change order
- ContentType comparison needs to be removed.
- Compare both lowercase and uppercase key names.
- do not silently error out constructing PutObjectOptions
if tag parsing fails
- avoid notification for empty object info, failed operations
should rely on valid objInfo for notification in all
situations
- optimize copyObject implementation, also introduce a new
replication event
- clone ObjectInfo() before scheduling for replication
- add additional headers for comparison
- remove strings.EqualFold comparison avoid unexpected bugs
- fix pool based proxying with multiple pools
- compare only specific metadata
Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>
This commit refactors the SSE implementation and add
S3-compatible SSE-KMS context handling.
SSE-KMS differs from SSE-S3 in two main aspects:
1. The client can request a particular key and
specify a KMS context as part of the request.
2. The ETag of an SSE-KMS encrypted object is not
the MD5 sum of the object content.
This commit only focuses on the 1st aspect.
A client can send an optional SSE context when using
SSE-KMS. This context is remembered by the S3 server
such that the client does not have to specify the
context again (during multipart PUT / GET / HEAD ...).
The crypto. context also includes the bucket/object
name to prevent renaming objects at the backend.
Now, AWS S3 behaves as following:
- If the user does not provide a SSE-KMS context
it does not store one - resp. does not include
the SSE-KMS context header in the response (e.g. HEAD).
- If the user specifies a SSE-KMS context without
the bucket/object name then AWS stores the exact
context the client provided but adds the bucket/object
name internally. The response contains the KMS context
without the bucket/object name.
- If the user specifies a SSE-KMS context with
the bucket/object name then AWS again stores the exact
context provided by the client. The response contains
the KMS context with the bucket/object name.
This commit implements this behavior w.r.t. SSE-KMS.
However, as of now, no such object can be created since
the server rejects SSE-KMS encryption requests.
This commit is one stepping stone for SSE-KMS support.
Co-authored-by: Harshavardhana <harsha@minio.io>
```
minio server /tmp/disk{1...4}
mc mb myminio/testbucket/
mkdir -p /tmp/disk{1..4}/testbucket/test-prefix/
```
This would end up being listed in the current
master, this PR fixes this situation.
If a directory is a leaf dir we should it
being listed, since it cannot be deleted anymore
with DeleteObject, DeleteObjects() API calls
because we natively support directories now.
Avoid listing it and let healing purge this folder
eventually in the background.
MINIO_API_REPLICATION_WORKERS env.var and
`mc admin config set api` allow number of replication
workers to be configurable. Defaults to half the number
of cpus available.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
This commit fixes a bug in the S3 gateway that causes
GET requests to fail when the object is encrypted by the
gateway itself.
The gateway was not able to GET the object since it always
specified a `If-Match` pre-condition checking that the object
ETag matches an expected ETag - even for encrypted ETags.
The problem is that an encrypted ETag will never match the ETag
computed by the backend causing the `If-Match` pre-condition
to fail.
This commit fixes this by not sending an `If-Match` header when
the ETag is encrypted. This is acceptable because:
1. A gateway-encrypted object consists of two objects at the backend
and there is no way to provide a concurrency-safe implementation
of two consecutive S3 GETs in the deployment model of the S3
gateway.
Ref: S3 gateways are self-contained and isolated - and there may
be multiple instances at the same time (no lock across
instances).
2. Even if the data object changes (concurrent PUT) while gateway
A has download the metadata object (but not issued the GET to
the data object => data race) then we don't return invalid data
to the client since the decryption (of the currently uploaded data)
will fail - given the metadata of the previous object.
Currently, it is not possible to create a delete-marker when xl.meta
does not exist (no version is created for that object yet). This makes a
problem for replication and mc mirroring with versioning enabled.
This also follows S3 specification.
This commit deprecates the native Hashicorp Vault
support and removes the legacy Vault documentation.
The native Hashicorp Vault documentation is marked as
outdated and deprecated for over a year now. We give
another 6 months before we start removing Hashicorp Vault
support and show a deprecation warning when a MinIO server
starts with a native Vault configuration.
currently we had a restriction where older setups would
need to follow previous style of "stripe" count being same
expansion, we can relax that instead newer pools can be
expanded for older setups with newer constraints of
common parity ratio.
In PR #11165 due to incorrect proxying for 2
way replication even when the object was not
yet replicated
Additionally, fix metadata comparisons when
deciding to do full replication vs metadata copy.
fixes#11340
Previously we added heal trigger when bit-rot checks
failed, now extend that to support heal when parts
are not found either. This healing gets only triggered
if we can successfully decode the object i.e read
quorum is still satisfied for the object.
under large deployments loading credentials might be
time consuming, while this is okay and we will not
respond quickly for `mc admin user list` like queries
but it is possible to support `mc admin user info`
just like how we handle authentication by fetching
the user directly from persistent store.
additionally support service accounts properly,
reloaded from etcd during watch() - this was missing
This PR is also half way remedy for #11305
This change allows the MinIO server to be configured with a special (read-only)
LDAP account to perform user DN lookups.
The following configuration parameters are added (along with corresponding
environment variables) to LDAP identity configuration (under `identity_ldap`):
- lookup_bind_dn / MINIO_IDENTITY_LDAP_LOOKUP_BIND_DN
- lookup_bind_password / MINIO_IDENTITY_LDAP_LOOKUP_BIND_PASSWORD
- user_dn_search_base_dn / MINIO_IDENTITY_LDAP_USER_DN_SEARCH_BASE_DN
- user_dn_search_filter / MINIO_IDENTITY_LDAP_USER_DN_SEARCH_FILTER
This lookup-bind account is a service account that is used to lookup the user's
DN from their username provided in the STS API. When configured, searching for
the user DN is enabled and configuration of the base DN and filter for search is
required. In this "lookup-bind" mode, the username format is not checked and must
not be specified. This feature is to support Active Directory setups where the
DN cannot be simply derived from the username.
When the lookup-bind is not configured, the old behavior is enabled: the minio
server performs LDAP lookups as the LDAP user making the STS API request and the
username format is checked and configuring it is required.
STS tokens can be obtained by using local APIs
once the remote JWT token is presented, current
code was not validating the incoming token in the
first place and was incorrectly making a network
operation using that token.
For the most part this always works without issues,
but under adversarial scenarios it exposes client
to hand-craft a request that can reach internal
services without authentication.
This kind of proxying should be avoided before
validating the incoming token.
The user can see __XLDIR__ prefix in mc admin heal when the command
heals an empty object with a trailing slash. This commit decodes the
name of the object before sending it back to the upper level.
In-case user enables O_DIRECT for reads and backend does
not support it we shall proceed to turn it off instead
and print a warning. This validation avoids any unexpected
downtimes that users may incur.
DNSCache dialer is a global value initialized in
init(), whereas `go` keeps `var =` before `init()`
, also we don't need to keep proxy routers as
global entities - register the forwarder as
necessary to avoid crashes.
```
mc admin config set alias/ storage_class standard=EC:3
```
should only succeed if parity ratio is valid for all
server pools, if not we should fail proactively.
This PR also needs to bring other changes now that
we need to cater for variadic drive counts per pool.
Bonus fixes also various bugs reproduced with
- GetObjectWithPartNumber()
- CopyObjectPartWithOffsets()
- CopyObjectWithMetadata()
- PutObjectPart,PutObject with truncated streams
Trace data can be rather large and compresses fine.
Compress profile data in zip files:
```
277.895.314 before.profiles.zip
152.800.318 after.profiles.zip
```
Delete marker can have `metaSys` set to nil, that
can lead to crashes after the delete marker has
been healed.
Additionally also fix isObjectDangling check
for transitioned objects, that do not have parts
should be treated similar to Delete marker.
During expansion we need to validate if
- new deployment is expanded with newer constraints
- existing deployment is expanded with older constraints
- multiple server pools rejected if they have different
deploymentID and distribution algo
to verify moving content and preserving legacy content,
we have way to detect the objects through readdir()
this path is not necessary for most common cases on
newer setups, avoid readdir() to save multiple system
calls.
also fix the CheckFile behavior for most common
use case i.e without legacy format.
If an erasure set had a drive replacement recently, we don't
need to attempt healing on another drive with in the same erasure
set - this would ensure we do not double heal the same content
and also prioritizes usage for such an erasure set to be calculated
sooner.
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.
Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.
Update tests to test multiple sets scenario
Fixes support for using multiple base DNs for user search in the LDAP directory
allowing users from different subtrees in the LDAP hierarchy to request
credentials.
- The username in the produced credentials is now the full DN of the LDAP user
to disambiguate users in different base DNs.
parentDirIsObject is not using set level understanding
to check for parent objects, without this it can lead to
objects that can actually reside on a separate set as
objects and would conflict.
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.
This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.
If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
This ensures that all the prometheus monitoring and usage
trackers to avoid alerts configured, although we cannot
support v1 to v2 here - we can v2 to v3.
A lot of memory is consumed when uploading small files in parallel, use
the default upload parameters and add MINIO_AZURE_UPLOAD_CONCURRENCY for
users to tweak.
Synchronous replication can be enabled by setting the --sync
flag while adding a remote replication target.
This PR also adds proxying on GET/HEAD to another node in a
active-active replication setup in the event of a 404 on the current node.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.