In-case user enables O_DIRECT for reads and backend does
not support it we shall proceed to turn it off instead
and print a warning. This validation avoids any unexpected
downtimes that users may incur.
DNSCache dialer is a global value initialized in
init(), whereas `go` keeps `var =` before `init()`
, also we don't need to keep proxy routers as
global entities - register the forwarder as
necessary to avoid crashes.
```
mc admin config set alias/ storage_class standard=EC:3
```
should only succeed if parity ratio is valid for all
server pools, if not we should fail proactively.
This PR also needs to bring other changes now that
we need to cater for variadic drive counts per pool.
Bonus fixes also various bugs reproduced with
- GetObjectWithPartNumber()
- CopyObjectPartWithOffsets()
- CopyObjectWithMetadata()
- PutObjectPart,PutObject with truncated streams
Trace data can be rather large and compresses fine.
Compress profile data in zip files:
```
277.895.314 before.profiles.zip
152.800.318 after.profiles.zip
```
Delete marker can have `metaSys` set to nil, that
can lead to crashes after the delete marker has
been healed.
Additionally also fix isObjectDangling check
for transitioned objects, that do not have parts
should be treated similar to Delete marker.
During expansion we need to validate if
- new deployment is expanded with newer constraints
- existing deployment is expanded with older constraints
- multiple server pools rejected if they have different
deploymentID and distribution algo
to verify moving content and preserving legacy content,
we have way to detect the objects through readdir()
this path is not necessary for most common cases on
newer setups, avoid readdir() to save multiple system
calls.
also fix the CheckFile behavior for most common
use case i.e without legacy format.
If an erasure set had a drive replacement recently, we don't
need to attempt healing on another drive with in the same erasure
set - this would ensure we do not double heal the same content
and also prioritizes usage for such an erasure set to be calculated
sooner.
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.
Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.
Update tests to test multiple sets scenario
Fixes support for using multiple base DNs for user search in the LDAP directory
allowing users from different subtrees in the LDAP hierarchy to request
credentials.
- The username in the produced credentials is now the full DN of the LDAP user
to disambiguate users in different base DNs.
parentDirIsObject is not using set level understanding
to check for parent objects, without this it can lead to
objects that can actually reside on a separate set as
objects and would conflict.
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.
This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.
If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
This ensures that all the prometheus monitoring and usage
trackers to avoid alerts configured, although we cannot
support v1 to v2 here - we can v2 to v3.
A lot of memory is consumed when uploading small files in parallel, use
the default upload parameters and add MINIO_AZURE_UPLOAD_CONCURRENCY for
users to tweak.
Synchronous replication can be enabled by setting the --sync
flag while adding a remote replication target.
This PR also adds proxying on GET/HEAD to another node in a
active-active replication setup in the event of a 404 on the current node.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
30 seconds white spaces is long for some setups which time out when no
read activity in short time, reduce the subnet health white space ticker
to 5 seconds, since it has no cost at all.
Use separate sync.Pool for writes/reads
Avoid passing buffers for io.CopyBuffer()
if the writer or reader implement io.WriteTo or io.ReadFrom
respectively then its useless for sync.Pool to allocate
buffers on its own since that will be completely ignored
by the io.CopyBuffer Go implementation.
Improve this wherever we see this to be optimal.
This allows us to be more efficient on memory usage.
```
385 // copyBuffer is the actual implementation of Copy and CopyBuffer.
386 // if buf is nil, one is allocated.
387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
388 // If the reader has a WriteTo method, use it to do the copy.
389 // Avoids an allocation and a copy.
390 if wt, ok := src.(WriterTo); ok {
391 return wt.WriteTo(dst)
392 }
393 // Similarly, if the writer has a ReadFrom method, use it to do the copy.
394 if rt, ok := dst.(ReaderFrom); ok {
395 return rt.ReadFrom(src)
396 }
```
From readahead package
```
// WriteTo writes data to w until there's no more data to write or when an error occurs.
// The return value n is the number of bytes written.
// Any error encountered during the write is also returned.
func (a *reader) WriteTo(w io.Writer) (n int64, err error) {
if a.err != nil {
return 0, a.err
}
n = 0
for {
err = a.fill()
if err != nil {
return n, err
}
n2, err := w.Write(a.cur.buffer())
a.cur.inc(n2)
n += int64(n2)
if err != nil {
return n, err
}
```
with changes present to automatically throttle crawler
at runtime, there is no need to have an environment
value to disable crawling. crawling is a fundamental
piece for healing, lifecycle and many other features
there is no good reason anyone would need to disable
this on a production system.
* Apply suggestions from code review
globalSubscribers.NumSubscribers() is heavily used in S3 requests and it
uses mutex, use atomic.Load instead since it is faster
Co-authored-by: Anis Elleuch <anis@min.io>
Rewrite parentIsObject() function. Currently if a client uploads
a/b/c/d, we always check if c, b, a are actual objects or not.
The new code will check with the reverse order and quickly quit if
the segment doesn't exist.
So if a, b, c in 'a/b/c' does not exist in the first place, then returns
false quickly.
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.
This is causing an extract call to the disk
which can be penalizing in case of busy system
where disks receive many concurrent access.
mc admin trace does not show the correct handler name in the output: it
is printing `maxClients' for all handlers. The reason is that the wrong
order of handler wrapping.
Fixes two problems
- Double healing when bitrot is enabled, instead heal attempt
once in applyActions() before lifecycle is applied.
- If applyActions() is successful and getSize() returns proper
value, then object is accounted for and should be removed
from the oldCache namespace map to avoid double heal attempts.
main reason is that HealObjects starts a recursive listing
for each object, this can be a really really long time on
large namespaces instead avoid recursive listing just
perform HealObject() instead at the prefix.
delete's already handle purging dangling content, we
don't need to achieve this by doing recursive listing,
this in-turn can delay crawling significantly.
Optimizations include
- do not write the metacache block if the size of the
block is '0' and it is the first block - where listing
is attempted for a transient prefix, this helps to
avoid creating lots of empty metacache entries for
`minioMetaBucket`
- avoid the entire initialization sequence of cacheCh
, metacacheBlockWriter if we are simply going to skip
them when discardResults is set to true.
- No need to hold write locks while writing metacache
blocks - each block is unique, per bucket, per prefix
and also is written by a single node.
current implementation was incorrect, it in-fact
assumed only read quorum number of disks. in-fact
that value is only meant for read quorum good entries
from all online disks.
This PR fixes this behavior properly.
This commit refactors the code in `cmd/crypto`
and separates SSE-S3, SSE-C and SSE-KMS.
This commit should not cause any behavior change
except for:
- `IsRequested(http.Header)`
which now returns the requested type {SSE-C, SSE-S3,
SSE-KMS} and does not consider SSE-C copy headers.
However, SSE-C copy headers alone are anyway not valid.
Additional cases handled
- fix address situations where healing is not
triggered on failed writes and deletes.
- consider object exists during listing when
metadata can be successfully decoded.
always find the right set of online peers for remote listing,
this may have an effect on listing if the server is down - we
should do this to avoid always performing transient operations
on bucket->peerClient that is permanently or down for a long
period.
additionally also configure http2 healthcheck
values to quickly detect unstable connections
and let them timeout.
also use single transport for proxying requests
with missing nextMarker with delimiter based listing,
top level prefixes beyond 4500 or max-keys value
wouldn't be sent back for client to ask for the next
batch.
reproduced at a customer deployment, create prefixes
as shown below
```
for year in $(seq 2017 2020)
do
for month in {01..12}
do for day in {01..31}
do
mc -q cp file myminio/testbucket/dir/day_id=$year-$month-$day/;
done
done
done
```
Then perform
```
aws s3api --profile minio --endpoint-url http://localhost:9000 list-objects \
--bucket testbucket --prefix dir/ --delimiter / --max-keys 1000
```
You shall see missing NextMarker, this would disallow listing beyond max-keys
requested and also disallow beyond 4500 (maxKeyObjectList) prefixes being listed
because client wouldn't know the NextMarker available.
This PR addresses this situation properly by making the implementation
more spec compatible. i.e NextMarker in-fact can be either an object, a prefix
with delimiter depending on the input operation.
This issue was introduced after the list caching changes and has been present
for a while.
issue was introduced in #11106 the following
pattern
<-t.C // timer fired
if !t.Stop() {
<-t.C // timer hangs
}
Seems to hang at the last `t.C` line, this
issue happens because a fired timer cannot be
Stopped() anymore and t.Stop() returns `false`
leading to confusing state of usage.
Refactor the code such that use timers appropriately
with exact requirements in place.
Both Azure & S3 gateways call for object information before returning
the stream of the object, however, the object content/length could be
modified meanwhile, which means it can return a corrupted object.
Use ETag to ensure that the object was not modified during the GET call
crawler should only ListBuckets once not for each serverPool,
buckets are same across all pools, across sets and ListBuckets
always returns an unified view, once list buckets returns
sort it by create time to scan the latest buckets earlier
with the assumption that latest buckets would have lesser
content than older buckets allowing them to be scanned faster
and also to be able to provide more closer to latest view.