This PR is an attempt to make this configurable
as not all situations have same level of tolerable
delta, i.e disks are replaced days apart or even
hours.
There is also a possibility that nodes have drifted
in time, when NTP is not configured on the system.
data shards were wrong due to a healing bug
reported in #13803 mainly with unaligned object
sizes.
This PR is an attempt to automatically avoid
these shards, with available information about
the `xl.meta` and actually disk mtime.
- Allows setting a role policy parameter when configuring OIDC provider
- When role policy is set, the server prints a role ARN usable in STS API requests
- The given role policy is applied to STS API requests when the roleARN parameter is provided.
- Service accounts for role policy are also possible and work as expected.
FileInfo quorum shouldn't be passed down, instead
inferred after obtaining a maximally occurring FileInfo.
This PR also changes other functions that rely on
wrong quorum calculation.
Update tests as well to handle the proper requirement. All
these changes are needed when migrating from older deployments
where we used to set N/2 quorum for reads to EC:4 parity in
newer releases.
dataDir loosely based on maxima is incorrect and does not
work in all situations such as disks in the following order
- xl.json migration to xl.meta there may be partial xl.json's
leftover if some disks are not yet connected when the disk
is yet to come up, since xl.json mtime and xl.meta is
same the dataDir maxima doesn't work properly leading to
quorum issues.
- its also possible that XLV1 might be true among the disks
available, make sure to keep FileInfo based on common quorum
and skip unexpected disks with the older data format.
Also, this PR tests upgrade from older to a newer release if the
data is readable and matches the checksum.
NOTE: this is just initial work we can build on top of this to do further tests.
markRootDisksAsDown() relies on disk info even if the
disk is unformatted. Therefore, we should always return
DiskInfo data even when DiskInfo storage API returns
errUnformattedDisk
backend-encrypted doesn't need to be explicitly healed anymore
since this file is deleted upon upgrade and migration to the
KMS based encrypted config/IAM credentials.
https://github.com/minio/console takes over the functionality for the
future object browser development
Signed-off-by: Harshavardhana <harsha@minio.io>
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.
Some new additions include:
- Admin APIs remote tier configuration management
- Simple journal to track remote objects to be 'collected'
This is used by object API handlers which 'mutate' object versions by
overwriting/replacing content (Put/CopyObject) or removing the version
itself (e.g DeleteObjectVersion).
- Rework of previous ILM transition to fit the new model
In the new model, a storage class (a.k.a remote tier) is defined by the
'remote' object storage type (one of s3, azure, GCS), bucket name and a
prefix.
* Fixed bugs, review comments, and more unit-tests
- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
- write in o_dsync instead of o_direct for smaller
objects to avoid unaligned double Write() situations
that may arise for smaller objects < 128KiB
- avoid fallocate() as its not useful since we do not
use Append() semantics anymore, fallocate is not useful
for streaming I/O we can save on a syscall
- createFile() doesn't need to validate `bucket` name
with a Lstat() call since createFile() is only used
to write at `minioTmpBucket`
- use io.Copy() when writing unAligned writes to allow
usage of ReadFrom() from *os.File providing zero
buffer writes().
major performance improvements in range GETs to avoid large
read amplification when ranges are tiny and random
```
-------------------
Operation: GET
Operations: 142014 -> 339421
Duration: 4m50s -> 4m56s
* Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s
* Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s
* 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s
* Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s
```
TTFB from 10MiB BlockSize
```
* First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s
```
TTFB from 1MiB BlockSize
```
* First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms
```
Full object reads however do see a slight change which won't be
noticeable in real world, so not doing any comparisons
TTFB still had improvements with full object reads with 1MiB
```
* First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s
```
v/s
TTFB with 10MiB
```
* First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s
```
This change should affect all new uploads, previous uploads should
continue to work with business as usual. But dramatic improvements can
be seen with these changes.
Use separate sync.Pool for writes/reads
Avoid passing buffers for io.CopyBuffer()
if the writer or reader implement io.WriteTo or io.ReadFrom
respectively then its useless for sync.Pool to allocate
buffers on its own since that will be completely ignored
by the io.CopyBuffer Go implementation.
Improve this wherever we see this to be optimal.
This allows us to be more efficient on memory usage.
```
385 // copyBuffer is the actual implementation of Copy and CopyBuffer.
386 // if buf is nil, one is allocated.
387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
388 // If the reader has a WriteTo method, use it to do the copy.
389 // Avoids an allocation and a copy.
390 if wt, ok := src.(WriterTo); ok {
391 return wt.WriteTo(dst)
392 }
393 // Similarly, if the writer has a ReadFrom method, use it to do the copy.
394 if rt, ok := dst.(ReaderFrom); ok {
395 return rt.ReadFrom(src)
396 }
```
From readahead package
```
// WriteTo writes data to w until there's no more data to write or when an error occurs.
// The return value n is the number of bytes written.
// Any error encountered during the write is also returned.
func (a *reader) WriteTo(w io.Writer) (n int64, err error) {
if a.err != nil {
return 0, a.err
}
n = 0
for {
err = a.fill()
if err != nil {
return n, err
}
n2, err := w.Write(a.cur.buffer())
a.cur.inc(n2)
n += int64(n2)
if err != nil {
return n, err
}
```
This refactor is done for few reasons below
- to avoid deadlocks in scenarios when number
of nodes are smaller < actual erasure stripe
count where in N participating local lockers
can lead to deadlocks across systems.
- avoids expiry routines to run 1000 of separate
network operations and routes per disk where
as each of them are still accessing one single
local entity.
- it is ideal to have since globalLockServer
per instance.
- In a 32node deployment however, each server
group is still concentrated towards the
same set of lockers that partipicate during
the write/read phase, unlike previous minio/dsync
implementation - this potentially avoids send
32 requests instead we will still send at max
requests of unique nodes participating in a
write/read phase.
- reduces overall chattiness on smaller setups.
hotfix target will fetch the release tag prior to the latest commit and create a binary
with the same release tag plus '.hotfix' suffix
e.g. RELEASE.2020-12-03T05-49-24Z.hotfix
Bonus fix during versioning merge one of the PR was missing
the offline/online disk count fix from #9801 port it correctly
over to the master branch from release.
Additionally, add versionID support for MRF
Fixes#9910Fixes#9931
This PR also tries to simplify the approach taken in
object-locking implementation by preferential treatment
given towards full validation.
This in-turn has fixed couple of bugs related to
how policy should have been honored when ByPassGovernance
is provided.
Simplifies code a bit, but also duplicates code intentionally
for clarity due to complex nature of object locking
implementation.
Change distributed locking to allow taking bulk locks
across objects, reduces usually 1000 calls to 1.
Also allows for situations where multiple clients sends
delete requests to objects with following names
```
{1,2,3,4,5}
```
```
{5,4,3,2,1}
```
will block and ensure that we do not fail the request
on each other.
we don't need to validateFormats again once we have obtained
reference format, because it is possible that at this stage
another server is doing a disk heal during startup, once
in a while due to delays we get false positives and our
server doesn't start.
Format in quorum as reference format can be assumed as valid
and we proceed further, until and unless HealFormat re-inits
the disks after a successful heal.
Also use separate port for healing tests to avoid any
conflicts with regular build testing.
Fixes#8884
This is to fix a situation where an object name incorrectly
is sent with '//' in its path heirarchy, we should reject
such object names because they may be hashed to a set where
the object might not originally belong because, this can
cause situations where once object is uploaded we cannot
delete it anymore.
Fixes#8873
Zone abstraction of object layer was returning `nil`
incorrectly under situations where disk healing is
not required. Returning `nil` is considered as healing
successful, which leads to unexpected ReloadFormat()
peer notification calls during startup.
This PR fixes this behavior properly for zones.
Use reference format to initialize lockers
during startup, also handle `nil` for NetLocker
in dsync and remove *errorLocker* implementation
Add further tuning parameters such as
- DialTimeout is now 15 seconds from 30 seconds
- KeepAliveTimeout is not 20 seconds, 5 seconds
more than default 15 seconds
- ResponseHeaderTimeout to 10 seconds
- ExpectContinueTimeout is reduced to 3 seconds
- DualStack is enabled by default remove setting
it to `true`
- Reduce IdleConnTimeout to 30 seconds from
1 minute to avoid idleConn build up
Fixes#8773
Fixes an issue reported by @klauspost and @vadmeste
This PR also allows users to expand their clusters
from single node XL deployment to distributed mode.
Currently, the backend minio server uses the same data directory
as the mint test itself, causing `s3 sync` to fail often.
Now `minio` backend will use a different data directory `/data`
instead of `/mint/data`
Simplify the cmd/http package overall by removing
custom plain text v/s tls connection detection, by
migrating to go1.12 and choose minimum version
to be go1.12
Also remove all the vendored deps, since they
are not useful anymore.
Also add a cross compile script to test always cross
compilation for some well known platforms and architectures
, we support out of box compilation of these platforms even
if we don't make an official release build.
This script is to avoid regressions in this area when we
add platform dependent code.
This PR is the first set of changes to move the config
to the backend, the changes use the existing `config.json`
allows it to be migrated such that we can save it in on
backend disks.
In future releases, we will slowly migrate out of the
current architecture.
Fixes#6182
Add compile time GOROOT path to the list of prefix
of file paths to be removed.
Add webhandler function names to the slice that
stores function names to terminate logging.
Since go1.8 GOPATH is not required to set prior, as
it defaults to "${HOME}/go" we only need to check if
go tool detected GOPATH correctly. If yes then we
proceed if not we fail.
This PR implements an object layer which
combines input erasure sets of XL layers
into a unified namespace.
This object layer extends the existing
erasure coded implementation, it is assumed
in this design that providing > 16 disks is
a static configuration as well i.e if you started
the setup with 32 disks with 4 sets 8 disks per
pack then you would need to provide 4 sets always.
Some design details and restrictions:
- Objects are distributed using consistent ordering
to a unique erasure coded layer.
- Each pack has its own dsync so locks are synchronized
properly at pack (erasure layer).
- Each pack still has a maximum of 16 disks
requirement, you can start with multiple
such sets statically.
- Static sets set of disks and cannot be
changed, there is no elastic expansion allowed.
- Static sets set of disks and cannot be
changed, there is no elastic removal allowed.
- ListObjects() across sets can be noticeably
slower since List happens on all servers,
and is merged at this sets layer.
Fixes#5465Fixes#5464Fixes#5461Fixes#5460Fixes#5459Fixes#5458Fixes#5460Fixes#5488Fixes#5489Fixes#5497Fixes#5496
It was decided that we will be deprecating ARM support
for minio builds. ARM users should simply compile from source.
Additionally 32bit version of Linux, Windows and FreeBSD (64bit)
are deprecated.
Wait for remote hosts to resolve instead of failing on first host
resolution error, when running in Kubernetes or Docker environment.
Note that
- Waiting is based on exponential back-off mechanism
- If run as a binary, server fails if remote host is not resolvable
This is needed because in orchestration platforms like Kubernetes, remote
hosts are started sequentially and all the hosts are not up initially,
though they are expected to come up in a short time frame
It is difficult to identify a cap on the waiting time due to
non-deterministic nature of infrastructure platforms, so the server waits
infinitely for the hosts to come up, while logging the error messages to
the console.
Fixes: https://github.com/minio/minio/issues/4669
Swarm routes traffic only to containers that report healthy status,
while Minio in distributed mode needs to talk to other peers before
it can respond to healthcheck probe. As the Minio containers are not
able to talk to each other, distributed Minio is not getting started
on Docker Swarm.
With this PR, Minio Healthcheck report healthy status for initial
120s enough for distributed Minio to start. After that normal
Healthcheck resumes. Also changed the healthcheck method name
in accordance with Google shell styleguide.
Fixes: https://github.com/minio/minio/issues/4761