Allow each crawler operation to sleep up to 10 seconds on very heavily loaded systems.
This will of course make minimum crawler speed less, but should be more effective at stopping.
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).
This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
This will make the health check clients 'silent'.
Use `IsNetworkOrHostDown` determine if network is ok so it mimics the functionality in the actual client.
this is needed such that we make sure to heal the
users, policies and bucket metadata right away as
we do listing based on list cache which only lists
'3' sufficiently good drives, to avoid possibly
losing access to these users upon upgrade make
sure to heal them.
If a scanning server shuts down unexpectedly we may have "successful" caches that are incomplete on a set.
In this case mark the cache with an error so it will no longer be handed out.
Add `MINIO_API_EXTEND_LIST_CACHE_LIFE` that will extend
the life of generated caches for a while.
This changes caches to remain valid until no updates have been
received for the specified time plus a fixed margin.
This also changes the caches from being invalidated when the *first*
set finishes until the *last* set has finished plus the specified time
has passed.
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.
Additionally, remove all unused walkers from the storage layer
A new field called AccessKey is added to the ReqInfo struct and populated.
Because ReqInfo is added to the context, this allows the AccessKey to be
accessed from 3rd-party code, such as a custom ObjectLayer.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Kaloyan Raev <kaloyan@storj.io>
On extremely long running listings keep the transient list 15 minutes after last update instead of using start time.
Also don't do overlap checks on transient lists.
Add trashcan that keeps recently updated lists after bucket deletion.
All caches were deleted once a bucket was deleted, so caches still running would report errors. Now they are canceled.
Fix `.minio.sys` not being transient.
Bonus fixes, remove package retry it is harder to get it
right, also manage context remove it such that we don't have
to rely on it anymore instead use a simple Jitter retry.
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers
used by `io.CopyBuffer` even if they are pooled.
Since all writers appear to write byte buffers, just send those
instead and write directly. The files are opened through the `os`
package so they have no special properties anyway.
This removes the alloc and copy for each operation.
REST sends content length so a precise alloc can be made.
this reduces allocations in order of magnitude
Also, revert "erasure: delete dangling objects automatically (#10765)"
affects list caching should be investigated.
Add store and a forward option for a single part
uploads when an async mode is enabled with env
MINIO_CACHE_COMMIT=writeback
It defaults to `writethrough` if unspecified.
Bonus fixes, we do not need reload format anymore
as the replaced drive is healed locally we only need
to ensure that drive heal reloads the drive properly.
We preserve the UUID of the original order, this means
that the replacement in `format.json` doesn't mean that
the drive needs to be reloaded into memory anymore.
fixes#10791
when server is booting up there is a possibility
that users might see '503' because object layer
when not initialized, then the request is proxied
to neighboring peers first one which is online.
* Fix caches having EOF marked as a failure.
* Simplify cache updates.
* Provide context for checkMetacacheState failures.
* Log 499 when the client disconnects.
`decryptObjectInfo` is a significant bottleneck when listing objects.
Reduce the allocations for a significant speedup.
https://github.com/minio/sio/pull/40
```
λ benchcmp before.txt after.txt
benchmark old ns/op new ns/op delta
Benchmark_decryptObjectInfo-32 24260928 808656 -96.67%
benchmark old MB/s new MB/s speedup
Benchmark_decryptObjectInfo-32 0.04 1.24 31.00x
benchmark old allocs new allocs delta
Benchmark_decryptObjectInfo-32 75112 48996 -34.77%
benchmark old bytes new bytes delta
Benchmark_decryptObjectInfo-32 287694772 4228076 -98.53%
```
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
only newly replaced drives get the new `format.json`,
this avoids disks reloading their in-memory reference
format, ensures that drives are online without
reloading the in-memory reference format.
keeping reference format in-tact means UUIDs
never change once they are formatted.
lockers currently might leave stale lockers,
in unknown ways waiting for downed lockers.
locker check interval is high enough to safely
cleanup stale locks.
reference format should be source of truth
for inconsistent drives which reconnect,
add them back to their original position
remove automatic fix for existing offline
disk uuids
Bonus fixes
- logging improvements to ensure that we don't use
`go logger.LogIf` to avoid runtime.Caller missing
the function name. log where necessary.
- remove unused code at erasure sets
Test TestDialContextWithDNSCacheRand was failing sometimes because it depends
on a random selection of addresses when testing random DNS resolution from cache.
Lower addr selection exception to 10%
Allow requests to come in for users as soon as object
layer and config are initialized, this allows users
to be authenticated sooner and would succeed automatically
on servers which are yet to fully initialize.
Go stdlib resolver doesn't support caching DNS
resolutions, since we compile with CGO disabled
we are more probe to DNS flooding for all network
calls to resolve for DNS from the DNS server.
Under various containerized environments such as
VMWare this becomes a problem because there are
no DNS caches available and we may end up overloading
the kube-dns resolver under concurrent I/O.
To circumvent this issue implement a DNSCache resolver
which resolves DNS and caches them for around 10secs
with every 3sec invalidation attempted.
connect disks pre-emptively upon startup, to ensure we have
enough disks are connected at startup rather than wait
for them.
we need to do this to avoid long wait times for server to
be online when we have servers come up in rolling upgrade
fashion
Only use dynamic delays for the crawler. Even though the max wait was 1 second the number
of waits could severely impact crawler speed.
Instead of relying on a global metric, we use the stateless local delays to keep the crawler
running at a speed more adjusted to current conditions.
The only case we keep it is before bitrot checks when enabled.
This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes
- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
syn backlogs.
fixes#10413
This change tracks bandwidth for a bucket and object
- [x] Add Admin API
- [x] Add Peer API
- [x] Add BW throttling
- [x] Admin APIs to set replication limit
- [x] Admin APIs for fetch bandwidth
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.
allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.
In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
This is to allow remote targets to be generalized
for replication/ILM transition
Also adding a field in BucketTarget to identify
a remote target with a label.
This commit fixes a misuse of the `http.ResponseWriter.WriteHeader`.
A caller should **either** call `WriteHeader` exactly once **or**
write to the response writer and causing an implicit 200 OK.
Writing the response headers more than once causes a `http: superfluous
response.WriteHeader call` log message. This commit fixes this
by preventing a 2nd `WriteHeader` call being forwarded to the underlying
`ResponseWriter`.
Updates #10587
* add NVMe drive info [model num, serial num, drive temp. etc.]
* Ignore fuse partitions
* Add the nvme logic only for linux
* Move smart/nvme structs to a separate file
Co-authored-by: wlan0 <sidharthamn@gmail.com>
throw proper error when port is not accessible
for the regular user, this is possibly a regression.
```
ERROR Unable to start the server: Insufficient permissions to use specified port
> Please ensure MinIO binary has 'cap_net_bind_service=+ep' permissions
HINT:
Use 'sudo setcap cap_net_bind_service=+ep /path/to/minio' to provide sufficient permissions
```
After #10594 let's invalidate the bloom filters to force the next cycles to go through all data.
There is a small chance that the linked PR could have caused missing bloom filter data.
This will invalidate the current bloom filters and make the crawler go through everything.
Routing using on source IP if found. This should distribute
the listing load for V1 and versioning on multiple nodes
evenly between different clients.
If source IP is not found from the http request header, then falls back
to bucket name instead.
Disallow versioning suspension on a bucket with
pre-existing replication configuration
If versioning is suspended on the target,replication
should fail.
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.
Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
- Add owner information for expiry, locking, unlocking a resource
- TopLocks returns now locks in quorum by default, provides
a way to capture stale locks as well with `?stale=true`
- Simplify the quorum handling for locks to avoid from storage
class, because there were challenges to make it consistent
across all situations.
- And other tiny simplifications to reset locks.
context canceled errors bubbling up from the network
layer has the potential to be misconstrued as network
errors, taking prematurely a server offline and triggering
a health check routine avoid this potential occurrence.
isEnded() was incorrectly calculating if the current healing sequence is
ended or not. h.currentStatus.Items could be empty if healing is very
slow and mc admin heal consumed all items.
As the bulk/recursive delete will require multiple connections to open at an instance,
The default open connections limit will be reached which results in the following error
```FATAL: sorry, too many clients already```
By setting the open connections to a reasonable value - `2`, We ensure that the max open connections
will not be exhausted and lie under bounds.
The queries are simple inserts/updates/deletes which is operational and sufficient with the
the maximum open connection limit is 2.
Fixes#10553
Allow user configuration for MaxOpenConnections
It is possible the heal drives are not reported from
the maintenance check because the background heal
state simply relied on the `format.json` for capturing
unformatted drives. It is possible that drives might
be still healing - make sure that applications which
rely on cluster health check respond back this detail.
Also, revamp the way ListBuckets work make few portions
of the healing logic parallel
- walk objects for healing disks in parallel
- collect the list of buckets in parallel across drives
- provide consistent view for listBuckets()
Healing was not working correctly in the distributed mode because
errFileVersionNotFound was not properly converted in storage rest
client.
Besides, fixing the healing delete marker is not working as expected.
performance improves by around 100x or more
```
go test -v -run NONE -bench BenchmarkGetPartFile
goos: linux
goarch: amd64
pkg: github.com/minio/minio/cmd
BenchmarkGetPartFileWithTrie
BenchmarkGetPartFileWithTrie-4 1000000000 0.140 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/minio/minio/cmd 1.737s
```
fixes#10520
* Fix cases where minimum timeout > default timeout.
* Add defensive code for too small/negative timeouts.
* Never set timeout below the maximum value of a request.
* Protect against (unlikely) int64 wraps.
* Decrease timeout slower.
* Don't re-lock before copying.
This bug was introduced in 14f0047295
almost 3yrs ago, as a side affect of removing stale `fs.json`
but we in-fact end up removing existing good `fs.json` for an
existing object, leading to some form of a data loss.
fixes#10496
from 20s for 10000 parts to less than 1sec
Without the patch
```
~ time aws --endpoint-url=http://localhost:9000 --profile minio s3api \
list-parts --bucket testbucket --key test \
--upload-id c1cd1f50-ea9a-4824-881c-63b5de95315a
real 0m20.394s
user 0m0.589s
sys 0m0.174s
```
With the patch
```
~ time aws --endpoint-url=http://localhost:9000 --profile minio s3api \
list-parts --bucket testbucket --key test \
--upload-id c1cd1f50-ea9a-4824-881c-63b5de95315a
real 0m0.891s
user 0m0.624s
sys 0m0.182s
```
fixes#10503
It was observed in VMware vsphere environment during a
pod replacement, `mc admin info` might report incorrect
offline nodes for the replaced drive. This issue eventually
goes away but requires quite a lot of time for all servers
to be in sync.
This PR fixes this behavior properly.
If the ILM document requires removing noncurrent versions, the
the server should be able to remove 'null' versions as well.
'null' versions are created when versioning is not enabled
or suspended.
The entire encryption layer is dependent on the fact that
KMS should be configured for S3 encryption to work properly
and we only support passing the headers as is to the backend
for encryption only if KMS is configured.
Make sure that this predictability is maintained, currently
the code was allowing encryption to go through and fail
at later to indicate that KMS was not configured. We should
simply reply "NotImplemented" if KMS is not configured, this
allows clients to simply proceed with their tests.
This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.
The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.
https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150
To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.
Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.
We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
configurable remote transport timeouts for some special cases
where this value needs to be bumped to a higher value when
transferring large data between federated instances.
In `(*cacheObjects).GetObjectNInfo` copy the metadata before spawning a goroutine.
Clean up a few map[string]string copies as well, reducing allocs and simplifying the code.
Fixes#10426
From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions
```
When specifying the number of days in the NoncurrentVersionTransition
and NoncurrentVersionExpiration actions in a Lifecycle configuration,
note the following:
It is the number of days from when the version of the object becomes
noncurrent (that is, when the object is overwritten or deleted), that
Amazon S3 will perform the action on the specified object or objects.
Amazon S3 calculates the time by adding the number of days specified in
the rule to the time when the new successor version of the object is
created and rounding the resulting time to the next day midnight UTC.
For example, in your bucket, suppose that you have a current version of
an object that was created at 1/1/2014 10:30 AM UTC. If the new version
of the object that replaces the current version is created at 1/15/2014
10:30 AM UTC, and you specify 3 days in a transition rule, the
transition date of the object is calculated as 1/19/2014 00:00 UTC.
```
This PR adds a DNS target that ensures to update an entry
into Kubernetes operator when a bucket is created or deleted.
See minio/operator#264 for details.
Co-authored-by: Harshavardhana <harsha@minio.io>
MaxConnsPerHost can potentially hang a call without any
way to timeout, we do not need this setting for our proxy
and gateway implementations instead IdleConn settings are
good enough.
Also ensure to use NewRequestWithContext and make sure to
take the disks offline only for network errors.
Fixes#10304
inconsistent drive healing when one of the drive is offline
while a new drive was replaced, this change is to ensure
that we can add the offline drive back into the mix by
healing it again.
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
This commit refactors the certificate management implementation
in the `certs` package such that multiple certificates can be
specified at the same time. Therefore, the following layout of
the `certs/` directory is expected:
```
certs/
│
├─ public.crt
├─ private.key
├─ CAs/ // CAs directory is ignored
│ │
│ ...
│
├─ example.com/
│ │
│ ├─ public.crt
│ └─ private.key
└─ foobar.org/
│
├─ public.crt
└─ private.key
...
```
However, directory names like `example.com` are just for human
readability/organization and don't have any meaning w.r.t whether
a particular certificate is served or not. This decision is made based
on the SNI sent by the client and the SAN of the certificate.
***
The `Manager` will pick a certificate based on the client trying
to establish a TLS connection. In particular, it looks at the client
hello (i.e. SNI) to determine which host the client tries to access.
If the manager can find a certificate that matches the SNI it
returns this certificate to the client.
However, the client may choose to not send an SNI or tries to access
a server directly via IP (`https://<ip>:<port>`). In this case, we
cannot use the SNI to determine which certificate to serve. However,
we also should not pick "the first" certificate that would be accepted
by the client (based on crypto. parameters - like a signature algorithm)
because it may be an internal certificate that contains internal hostnames.
We would disclose internal infrastructure details doing so.
Therefore, the `Manager` returns the "default" certificate when the
client does not specify an SNI. The default certificate the top-level
`public.crt` - i.e. `certs/public.crt`.
This approach has some consequences:
- It's the operator's responsibility to ensure that the top-level
`public.crt` does not disclose any information (i.e. hostnames)
that are not publicly visible. However, this was the case in the
past already.
- Any other `public.crt` - except for the top-level one - must not
contain any IP SAN. The reason for this restriction is that the
Manager cannot match a SNI to an IP b/c the SNI is the server host
name. The entire purpose of SNI is to indicate which host the client
tries to connect to when multiple hosts run on the same IP. So, a
client will not set the SNI to an IP.
If we would allow IP SANs in a lower-level `public.crt` a user would
expect that it is possible to connect to MinIO directly via IP address
and that the MinIO server would pick "the right" certificate. However,
the MinIO server cannot determine which certificate to serve, and
therefore always picks the "default" one. This may lead to all sorts
of confusing errors like:
"It works if I use `https:instance.minio.local` but not when I use
`https://10.0.2.1`.
These consequences/limitations should be pointed out / explained in our
docs in an appropriate way. However, the support for multiple
certificates should not have any impact on how deployment with a single
certificate function today.
Co-authored-by: Harshavardhana <harsha@minio.io>
- do not fail the healthcheck if heal status
was not obtained from one of the nodes,
if many nodes fail then report this as a
catastrophic error.
- add "x-minio-write-quorum" value to match
the write tolerance supported by server.
- admin info now states if a drive is healing
where madmin.Disk.Healing is set to true
and madmin.Disk.State is "ok"
Currently, cache purges are triggered as soon as the low watermark is exceeded.
To reduce IO this should only be done when reaching the high watermark.
This simplifies checks and reduces all calls for a GC to go through
`dcache.diskSpaceAvailable(size)`. While a comment claims that
`dcache.triggerGC <- struct{}{}` was non-blocking I don't see how
that was possible. Instead, we add a 1 size to the queue channel
and use channel semantics to avoid blocking when a GC has
already been requested.
`bytesToClear` now takes the high watermark into account to it will
not request any bytes to be cleared until that is reached.
This commit reduces the retry delay when retrying a request
to a KES server by:
- reducing the max. jitter delay from 3s to 1.5s
- skipping the random delay when there are more KES endpoints
available.
If there are more KES endpoints we can directly retry to the request
by sending it to the next endpoint - as pointed out by @krishnasrinivas
- delete-marker should be created on a suspended bucket as `null`
- delete-marker should delete any pre-existing `null` versioned
object and create an entry `null`
When checking parts we already do a stat for each part.
Since we have the on disk size check if it is at least what we expect.
When checking metadata check if metadata is 0 bytes.
The `getNonLoopBackIP` may grab an IP from an interface that
doesn't allow binding (on Windows), so this test consistently fails.
We exclude that specific error.
* readDirN: Check if file is directory
`syscall.FindNextFile` crashes if the handle is a file.
`errFileNotFound` matches 'unix' functionality: d19b434ffc/cmd/os-readdir_unix.go (L106)Fixes#10384
This commit addresses a maintenance / automation problem when MinIO-KES
is deployed on bare-metal. In orchestrated env. the orchestrator (K8S)
will make sure that `n` KES servers (IPs) are available via the same DNS
name. There it is sufficient to provide just one endpoint.
ListObjectsV1 requests are actually redirected to a specific node,
depending on the bucket name. The purpose of this behavior was
to optimize listing.
However, the current code sends a Bad Gateway error if the
target node is offline, which is a bad behavior because it means
that the list request will fail, although this is unnecessary since
we can still use the current node to list as well (the default behavior
without using proxying optimization)
Currently, you can see mint fails when there is one offline node, after
this PR, mint will always succeed.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
If DiskInfo calls failed the information returned was used anyway
resulting in no endpoint being set.
This would make the drive be attributed to the local system since
`disk.Endpoint == disk.DrivePath` in that case.
Instead, if the call fails record the endpoint and the error only.
bonus make sure to ignore objectNotFound, and versionNotFound
errors properly at all layers, since HealObjects() returns
objectNotFound error if the bucket or prefix is empty.
When crawling never use a disk we know is healing.
Most of the change involves keeping track of the original endpoint on xlStorage
and this also fixes DiskInfo.Endpoint never being populated.
Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is
Healing, skipping` once on a cycle (no more often than every 5m).
We should only enforce quotas if no error has been returned.
firstErr is safe to access since all goroutines have exited at this point.
If `firstErr` hasn't been set by something else return the context error if cancelled.
Keep dataUpdateTracker while goroutine is starting.
This will ensure the object is updated one `start` returns
Tested with
```
λ go test -cpu=1,2,4,8 -test.run TestDataUpdateTracker -count=1000
PASS
ok github.com/minio/minio/cmd 8.913s
```
Fixes#10295
fresh drive setups when one of the drive is
a root drive, we should ignore such a root
drive and not proceed to format.
This PR handles this properly by marking
the disks which are root disk and they are
taken offline.
newDynamicTimeout should be allocated once, in-case
of temporary locks in config and IAM we should
have allocated timeout once before the `for loop`
This PR doesn't fix any issue as such, but provides
enough dynamism for the timeout as per expectation.
Based on our previous conversations I assume we should send the version
id when healing an object.
Maybe we should even list object versions and heal all?
In a non recursive mode, issuing a list request where prefix
is an existing object with a slash and delimiter is a slash will
return entries in the object directory (data dir IDs)
```
$ aws s3api --profile minioadmin --endpoint-url http://localhost:9000 \
list-objects-v2 --bucket testbucket --prefix code_of_conduct.md/ --delimiter '/'
{
"CommonPrefixes": [
{
"Prefix":
"code_of_conduct.md/ec750fe0-ea7e-4b87-bbec-1e32407e5e47/"
}
]
}
```
This commit adds a fast exit track in Walk() in this specific case.
use `/etc/hosts` instead of `/` to check for common
device id, if the device is same for `/etc/hosts`
and the --bind mount to detect root disks.
Bonus enhance healthcheck logging by adding maintenance
tags, for all messages.
adds a feature where we can fetch the MinIO
command-line remotely, this
is primarily meant to add some stateless
nature to the MinIO deployment in k8s
environments, MinIO operator would run a
webhook service endpoint
which can be used to fetch any environment
value in a generalized approach.
It is possible in situations when server was deployed
in asymmetric configuration in the past such as
```
minio server ~/fs{1...4}/disk{1...5}
```
Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'
While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.
This PR fixes this behavior.
Bonus also fix a bug where we did not purge relevant
service accounts generated by rotating credentials
appropriately, service accounts should become invalid
as soon as its corresponding parent user becomes invalid.
Since service account themselves carry parent claim always
we would never reach this problem, as the access get
rejected at IAM policy layer.
when source and destination are same and versioning is enabled
on the destination bucket - we do not need to re-create the entire
object once again to optimize on space utilization.
Cases this PR is not supporting
- any pre-existing legacy object will not
be preserved in this manner, meaning a new
dataDir will be created.
- key-rotation and storage class changes
of course will never re-use the dataDir
conflicting files can exist on FS at
`.minio.sys/buckets/testbucket/policy.json/`, this is an
expected valid scenario for FS mode allow it to work,
i.e ignore and move forward
With reduced parity our write quorum should be same
as read quorum, but code was still assuming
```
readQuorum+1
```
In all situations which is not necessary.
Generalize replication target management so
that remote targets for a bucket can be
managed with ARNs. `mc admin bucket remote`
command will be used to manage targets.
Context timeout might race on each other when timeouts are lower
i.e when two lock attempts happened very quickly on the same resource
and the servers were yet trying to establish quorum.
This situation can lead to locks held which wouldn't be unlocked
and subsequent lock attempts would fail.
This would require a complete server restart. A potential of this
issue happening is when server is booting up and we are trying
to hold a 'transaction.lock' in quick bursts of timeout.
replace dummy buffer with nullReader{} instead,
to avoid large memory allocations in memory
constrainted environments. allows running
obd tests in such environments.
Currently, listing directories on HDFS incurs a per-entry remote Stat() call
penalty, the cost of which can really blow up on directories with many
entries (+1,000) especially when considered in addition to peripheral
calls (such as validation) and the fact that minio is an intermediary to the
client (whereas other clients listed below can query HDFS directly).
Because listing directories this way is expensive, the Golang HDFS library
provides the [`Client.Open()`] function which creates a [`FileReader`] that is
able to batch multiple calls together through the [`Readdir()`] function.
This is substantially more efficient for very large directories.
In one case we were witnessing about +20 seconds to list a directory with 1,500
entries, admittedly large, but the Java hdfs ls utility as well as the HDFS
library sample ls utility were much faster.
Hadoop HDFS DFS (4.02s):
λ ~/code/minio → use-readdir
» time hdfs dfs -ls /directory/with/1500/entries/
…
hdfs dfs -ls 5.81s user 0.49s system 156% cpu 4.020 total
Golang HDFS library (0.47s):
λ ~/code/hdfs → master
» time ./hdfs ls -lh /directory/with/1500/entries/
…
./hdfs ls -lh 0.13s user 0.14s system 56% cpu 0.478 total
mc and minio **without** optimization (16.96s):
λ ~/code/minio → master
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.22s user 0.29s system 3% cpu 16.968 total
mc and minio **with** optimization (0.40s):
λ ~/code/minio → use-readdir
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.13s user 0.28s system 102% cpu 0.403 total
[`Client.Open()`]: https://godoc.org/github.com/colinmarc/hdfs#Client.Open
[`FileReader`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader
[`Readdir()`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader.Readdir
If there are many listeners to bucket notifications or to the trace
subsystem, healing fails to work properly since it suspends itself when
the number of concurrent connections is above a certain threshold.
These connections are also continuous and not costly (*no disk access*),
it is okay to just ignore them in waitForLowHTTPReq().
this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
Fixes two different types of problems
- continuation of the problem seen in FS #9992
as not fixed for erasure coded deployments,
reproduced this issue with spark and its fixed now
- another issue was leaking walk go-routines which
would lead to high memory usage and crash the system
this is simply because all the walks which were purged
at the top limit had leaking end walkers which would
consume memory endlessly.
closes#9966closes#10088
not all claims need to be present for
the JWT claim, let the policies not
exist and only apply which are present
when generating the credentials
once credentials are generated then
those policies should exist, otherwise
the request will fail.
- copyObject in-place decryption failed
due to incorrect verification of headers
- do not decode ETag when object is encrypted
with SSE-C, so that pre-conditions don't fail
prematurely.
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.
This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
In federated NAS gateway setups, multiple hosts in srvRecords
was picked at random which could mean that if one of the
host was down the request can indeed fail and if client
retries it would succeed. Instead allow server to figure
out the current online host quickly such that we can
exclude the host which is down.
At the max the attempt to look for a downed node is to
300 millisecond, if the node is taking longer to respond
than this value we simply ignore and move to the node,
total attempts are equal to number of srvRecords if no
server is online we simply fallback to last dialed host.
healing was not working properly when drives were
replaced, due to the error check in root disk
calculation this PR fixes this behavior
This PR also adds additional fix for missing
metadata entries from .minio.sys as part of
disk healing as well.
Added code to ignore and print more context
sensitive errors for better debugging.
This PR is continuation of fix in 7b14e9b660
Enforce bucket quotas when crawling has finished.
This ensures that we will not do quota enforcement on old data.
Additionally, delete less if we are closer to quota than we thought.
for users who don't have access to HDFS rootPath '/'
can optionally specify `minio gateway hdfs hdfs://namenode:8200/path`
for which they have access to, allowing all writes to be
performed at `/path`.
NOTE: once configured in this manner you need to make
sure command line is correctly specified, otherwise
your data might not be visible
closes#10011
this PR to allow legacy support for big-data
applications which run older Java versions
which do not support the secure ciphers
currently defaulted in MinIO. This option
allows optionally to turn them off such
that client and server can negotiate the
best ciphers themselves.
This env is purposefully not documented,
meant as a last resort when client
application cannot be changed easily.
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
while we handle all situations for writes and reads
on older format, what we didn't cater for properly
yet was delete where we only ended up deleting
just `xl.meta` - instead we should allow all the
deletes to go through for older format without
versioning enabled buckets.
CORS is notorious requires specific headers to be
handled appropriately in request and response,
using cors package as part of handlerFunc() for
options method lacks the necessary control this
package needs to add headers.
Without instantiating a new rest client we can
have a recursive error which can lead to
healthcheck returning always offline, this can
prematurely take the servers offline.
gorilla/mux broke their recent release 1.7.4 which we
upgraded to, we need the current workaround to ensure
that our regex matches appropriately.
An upstream PR is sent, we should remove the
workaround once we have a new release.
- reduce locker timeout for early transaction lock
for more eagerness to timeout
- reduce leader lock timeout to range from 30sec to 1minute
- add additional log message during bootstrap phase
Main issue is that `t.pool[params]` should be `t.pool[oldest]`.
We add a bit more safety features for the code.
* Make writes to the endTimerCh non-blocking in all cases
so multiple releases cannot lock up.
* Double check expectations.
* Shift down deletes with copy instead of truncating slice.
* Actually delete the oldest if we are above total limit.
* Actually delete the oldest found and not the current.
* Unexport the mutex so nobody from the outside can meddle with it.
This commit adds a new admin API for creating master keys.
An admin client can send a POST request to:
```
/minio/admin/v3/kms/key/create?key-id=<keyID>
```
The name / ID of the new key is specified as request
query parameter `key-id=<ID>`.
Creating new master keys requires KES - it does not work with
the native Vault KMS (deprecated) nor with a static master key
(deprecated).
Further, this commit removes the `UpdateKey` method from the `KMS`
interface. This method is not needed and not used anymore.
with the merge of https://github.com/etcd-io/etcd/pull/11823
etcd v3.5.0 will now have a properly imported versioned path
this fixes our pending migration to newer repo
Use a separate client for these calls that can take a long time.
Add request context to these so they are canceled when the client
disconnects instead except for ListObject which doesn't have any equivalent.
Healing an object which has multiple versions was not working because
the healing code forgot to consider errFileVersionNotFound error as a
use case that needs healing
Currently, lifecycle expiry is deleting all object versions which is not
correct, unless noncurrent versions field is specified.
Also, only delete the delete marker if it is the only version of the
given object.
- additionally upgrade to msgp@v1.1.2
- change StatModTime,StatSize fields as
simple Size/ModTime
- reduce 50000 entries per List batch to 10000
as client needs to wait too long to see the
first batch some times which is not desired
and it is worth we write the data as soon
as we have it.
object KMS is configured with auto-encryption,
there were issues when using docker registry -
this has been left unnoticed for a while.
This PR fixes an issue with compatibility.
Additionally also fix the continuation-token implementation
infinite loop issue which was missed as part of #9939
Also fix the heal token to be generated as a client
facing value instead of what is remembered by the
server, this allows for the server to be stateless
regarding the token's behavior.
When manual healing is triggered, one node in a cluster will
become the authority to heal. mc regularly sends new requests
to fetch the status of the ongoing healing process, but a load
balancer could land the healing request to a node that is not
doing the healing request.
This PR will redirect a request to the node based on the node
index found described as part of the client token. A similar
technique is also used to proxy ListObjectsV2 requests
by encoding this information in continuation-token
The S3 specification says that versions are ordered in the response of
list object versions.
mc snapshot needs this to know which version comes first especially when
two versions have the same exact last-modified field.
Readiness as no reasoning to be cluster scope
because that is not how the k8s networking works
for pods, all the pods to a deployment are not
sharing the network in a singleton. Instead they
are run as local scopes to themselves, with
readiness failures the pod is potentially taken
out of the network to be resolvable - this
affects the distributed setup in myriad of
different ways.
Instead readiness should behave like liveness
with local scope alone, and should be a dummy
implementation.
This PR all the startup times and overal k8s
startup time dramatically improves.
Added another handler called as `/minio/health/cluster`
to understand the cluster scope health.
Walk() functionality was missing on gateway
implementations leading to missing functionality
for the browser UI such as remove multiple objects,
download as zip file etc.
This PR brings a generic implementation across
all gateway's, it is not required to repeat the
same code in all gateway's
The default behavior is to cache each range requested
to cache drive. Add an environment variable
`MINIO_RANGE_CACHE` - when set to off, it disables
range caching and instead downloads entire object
in the background.
Fixes#9870
Bonus fix during versioning merge one of the PR was missing
the offline/online disk count fix from #9801 port it correctly
over to the master branch from release.
Additionally, add versionID support for MRF
Fixes#9910Fixes#9931
This PR has the following changes
- Removing duplicate lookupConfigs() calls.
- Deprecate admin config APIs for NAS gateways. This will avoid repeated reloads of the config from the disk.
- WatchConfigNASDisk will be removed
- Migration guide for NAS gateways users to migrate to ENV settings.
NOTE: THIS PR HAS A BREAKING CHANGE
Fixes#9875
Co-authored-by: Harshavardhana <harsha@minio.io>
Looking into full disk errors on zoned setup. We don't take the
5% space requirement into account when selecting a zone.
The interesting part is that even considering this we don't
know the size of the object the user wants to upload when
they do multipart uploads.
It seems quite defensive to always upload multiparts to
the zone where there is the most space since all load will
be directed to a part of the cluster.
In these cases we make sure it can at least hold a 1GiB file
and we disadvantage fuller zones more by subtracting the
expected size before weighing.
- x-amz-storage-class specified CopyObject
should proceed regardless, its not a precondition
- sourceVersionID is specified CopyObject should
proceed regardless, its not a precondition
This PR fixes all the below scenarios
and handles them correctly.
- existing data/bucket is replaced with
new content, no versioning enabled old
structure vanishes.
- existing data/bucket - enable versioning
before uploading any data, once versioning
enabled upload new content, old content
is preserved.
- suspend versioning on the bucket again, now
upload content again the old content is purged
since that is the default "null" version.
Additionally sync data after xl.json -> xl.meta
rename(), to avoid any surprises if there is a
crash during this rename operation.
- Add changes to ensure remote disks are not
incorrectly taken online if their order has
changed or are incorrect disks.
- Bring changes to peer to detect disconnection
with separate Health handler, to avoid a
rather expensive call GetLocakDiskIDs()
- Follow up on the same changes for Lockers
as well
Just like GET/DELETE APIs it is possible to preserve
client supplied versionId's, of course the versionIds
have to be uuid, if an existing versionId is found
it is overwritten if no object locking policies
are found.
- PUT /bucketname/objectname?versionId=<id>
- POST /bucketname/objectname?uploads=&versionId=<id>
- PUT /bucketname/objectname?verisonId=<id> (with x-amz-copy-source)
Fixes potentially infinite allocations, especially in FS mode,
since lookups live up to 30 minutes. Limit walk pool sizes to 50
max parameter entries and 4 concurrent operations with the same
parameters.
Fixes#9835
PutObject on multiple-zone with versioning would not
overwrite the correct location of the object if the
object has delete marker, leading to duplicate objects
on two zones.
This PR fixes by adding affinity towards delete marker
when GetObjectInfo() returns error, use the zone index
which has the delete marker.
Bonus change to use channel to serialize triggers,
instead of using atomic variables. More efficient
mechanism for synchronization.
Co-authored-by: Nitish Tiwari <nitish@minio.io>
In the Current bug we were re-using the context
from previously granted lockers, this would
lead to lock timeouts for existing valid
read or write locks, leading to premature
timeout of locks.
This bug affects only local lockers in FS
or standalone erasure coded mode. This issue
is rather historical as well and was present
in lsync for some time but we were lucky to
not see it.
Similar changes are done in dsync as well
to keep the code more familiar
Fixes#9827
When updating all servers following the constructions of mc update,
only the endpoint server will be updated successfully.
All the other peer servers' updating failed due to the error below:
--------------------------------------------------------------------------
parsing time "2006-01-02T15:04:05Z07:00" as "<release version>": cannot parse "-01-02T15:04:05Z07:00" as "0-"
--------------------------------------------------------------------------
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
Historically due to lack of support for middlewares
we ended up writing wrapped handlers for all
middlewares on top of the gorilla/mux, this causes
multiple issues when we want to let's say
- Overload r.Body with some custom implementation
to track the incoming Reads()
- Add other sort of top level checks to avoid
DDOSing the server with large incoming HTTP
bodies.
Since 1.7.x release gorilla/mux provides proper
use of middlewares, which are honored by the muxer
directly. This makes sure that Go can honor its
own internal ServeHTTP(w, r) implementation where
Go net/http can wrap into its own customer readers.
This PR as a side-affect fixes rare issues of client
hangs which were reported in the wild but never really
understood or fixed in our codebase.
Fixes#9759Fixes#7266Fixes#6540Fixes#5455Fixes#5150
Refer https://github.com/boto/botocore/pull/1328 for
one variation of the same issue in #9759
PR #9801 while it is correct, the loop isEndpointConnected()
was changed to rely on endpoint.String() which has the host
information as well, which is not correct value as input to
detect if the disk is down or up, if endpoint is local use
its local path value instead.
This commit changes the data key generation such that
if a MinIO server/nodes tries to generate a new DEK
but the particular master key does not exist - then
MinIO asks KES to create a new master key and then
requests the DEK again.
From now on, a SSE-S3 master key must not be created
explicitly via: `kes key create <key-name>`.
Instead, it is sufficient to just set the env. var.
```
export MINIO_KMS_KES_KEY_NAME=<key-name>
```
However, the MinIO identity (mTLS client certificate)
must have the permission to access the `/v1/key/create/`
API. Therefore, KES policy for MinIO must look similar to:
```
[
/v1/key/create/<key-name-pattern>
/v1/key/generate/<key-name-pattern>
/v1/key/decrypt/<key-name-pattern>
]
```
However, in our guides we already suggest that.
See e.g.: https://github.com/minio/kes/wiki/MinIO-Object-Storage#kes-server-setup
***
The ability to create master keys on request may also be
necessary / useful in case of SSE-KMS.
Current code was relying on globalEndpoints as
the source of secondary truth to obtain
the missing endpoints list when the disk
is offline, this is problematic
- there is no way to know if the getDisks()
returned endpoints total is same as the
ones list of globalEndpoints and it
belongs to a particular set.
- there is no order guarantee as getDisks()
is ordered as per format.json, globalEndpoints
may not be, so potentially end up including
incorrect endpoints.
To fix this bring getEndpoints() just like getDisks()
to ensure that consistently ordered endpoints are
always available for us to ensure that returned values
are consistent with what each erasure set would observe.
Uploading files with names that could not be written to disk
would result in "reduce your request" errors returned.
Instead check explicitly for disallowed characters and reject
files with `Object name contains unsupported characters.`
At a customer setup with lots of concurrent calls
it can be observed that in newRetryTimer there
were lots of tiny alloations which are not
relinquished upon retries, in this codepath
we were only interested in re-using the timer
and use it wisely for each locker.
```
(pprof) top
Showing nodes accounting for 8.68TB, 97.02% of 8.95TB total
Dropped 1198 nodes (cum <= 0.04TB)
Showing top 10 nodes out of 79
flat flat% sum% cum cum%
5.95TB 66.50% 66.50% 5.95TB 66.50% time.NewTimer
1.16TB 13.02% 79.51% 1.16TB 13.02% github.com/ncw/directio.AlignedBlock
0.67TB 7.53% 87.04% 0.70TB 7.78% github.com/minio/minio/cmd.xlObjects.putObject
0.21TB 2.36% 89.40% 0.21TB 2.36% github.com/minio/minio/cmd.(*posix).Walk
0.19TB 2.08% 91.49% 0.27TB 2.99% os.statNolog
0.14TB 1.59% 93.08% 0.14TB 1.60% os.(*File).readdirnames
0.10TB 1.09% 94.17% 0.11TB 1.25% github.com/minio/minio/cmd.readDirN
0.10TB 1.07% 95.23% 0.10TB 1.07% syscall.ByteSliceFromString
0.09TB 1.03% 96.27% 0.09TB 1.03% strings.(*Builder).grow
0.07TB 0.75% 97.02% 0.07TB 0.75% path.(*lazybuf).append
```
For example `{1...17}/{1...52}` symmetrical
distribution of drives cannot be obtained
- Because 17 is a prime number
- Is not divisible by any pre-defined setCounts i.e
from 1 to 16
Manual healing (as background healing) creates a heal task with a
possiblity to override healing options, such as deep or normal mode.
Use a pointer type in heal opts so nil would mean use the default
healing options.
aws cli fails to set a bucket encryption configuration to MinIO server.
The reason is that aws cli does not send MD5-Content header. It seems
that MD5-Content is not required anymore.
This commit also returns Not Implemented header early to help mint tests
to ignore testing this API in gateway modes.
CopyObject was not correctly figuring out the correct
destination object location and would end up creating
duplicate objects on two different zones, reproduced
by doing encryption based key rotation.
Advantages avoids 100's of stats which are needed for each
upload operation in FS/NAS gateway mode when uploading a large
multipart object, dramatically increases performance for
multipart uploads by avoiding recursive calls.
For other gateway's simplifies the approach since
azure, gcs, hdfs gateway's don't capture any specific
metadata during upload which needs handler validation
for encryption/compression.
Erasure coding was already optimized, additionally
just avoids small allocations of large data structure.
Fixes#7206
GetDiskID() in storage rest client does not really issue a REST request
to the remote disk, but returns an in-memory value instead.
However, GetDiskID() should return an error when format.json is not
found or for other similar issues (unmounted disks, etc..)
GetDiskID() is only called when formatting disks and getting storage
informatio, hence this commit should not have a performance degradation.
Additionally also fix STS logs to filter out LDAP
password to be sent out in audit logs.
Bonus fix handle the reload of users properly by
making sure to preserve the newer users during the
reload to be not invalidated.
Fixes#9707Fixes#9644Fixes#9651
Bonus fixes in quota enforcement to use the
new datastructure and use timedValue to cache
a value/reload automatically avoids one less
global variable.
If the requested server is part of the set this will always read
from the local disk, even if the disk contains a parity shard.
In default setup there is a 50% chance that at least
one shard that otherwise would have been fetched remotely
will be read locally instead.
It basically trades RPC call overhead for reed-solomon.
On distributed localhost this seems to be fairly break-even,
with a very small gain in throughput and latency.
However on networked servers this should be a bigger
1MB objects, before:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 76257:
* Avg: 25ms 50%: 24ms 90%: 32ms 99%: 42ms Fastest: 7ms Slowest: 67ms
* First Byte: Average: 23ms, Median: 22ms, Best: 5ms, Worst: 65ms
Throughput:
* Average: 1213.68 MiB/s, 1272.63 obj/s (59.948s, starting 14:45:44 CEST)
```
After:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 78845:
* Avg: 24ms 50%: 24ms 90%: 31ms 99%: 39ms Fastest: 8ms Slowest: 62ms
* First Byte: Average: 22ms, Median: 21ms, Best: 6ms, Worst: 57ms
Throughput:
* Average: 1255.11 MiB/s, 1316.08 obj/s (59.938s, starting 14:43:58 CEST)
```
Bonus fix: Only ask for heal once on an object.
This value is requested on every upload when there are multiple zones.
Since this will result in an RPC call to every remote disk this scales
quite badly in a distributed setup. Load every 1second interval.
2 servers, localhost only. In large distributed setups much bigger
gains can be expected.
```
Operations: 21743 -> 22454
* Average: +3.28% (+0.0 MiB/s) throughput, +3.28% (+11.9) obj/s
* Fastest: +3.37% (+0.0 MiB/s) throughput, +3.37% (+13.0) obj/s
* 50% Median: +3.03% (+0.0 MiB/s) throughput, +3.03% (+11.2) obj/s
* Slowest: +8.03% (+0.0 MiB/s) throughput, +8.03% (+22.8) obj/s
```
For easy management of this a generic helper has been added.
some clients such as veeam expect the x-amz-meta to
be sent in lower cased form, while this does indeed
defeats the HTTP protocol contract it is harder to
change these applications, while these applications
get fixed appropriately in future.
x-amz-meta is usually sent in lowercased form
by AWS S3 and some applications like veeam
incorrectly end up relying on the case sensitivity
of the HTTP headers.
Bonus fixes
- Fix the iso8601 time format to keep it same as
AWS S3 response
- Increase maxObjectList to 50,000 and use
maxDeleteList as 10,000 whenever multi-object
deletes are needed.
No one really uses FS for large scale accounting
usage, neither we crawl in NAS gateway mode. It is
worthwhile to simply disable this feature as its
not useful for anyone.
Bonus disable bucket quota ops as well in, FS
and gateway mode
size calculation in crawler was using the real size
of the object instead of its actual size i.e either
a decrypted or uncompressed size.
this is needed to make sure all other accounting
such as bucket quota and mcs UI to display the
correct values.
This PR adds a new configuration parameter which allows readiness
check to respond within 10secs, this can be reduced to a lower value
if necessary using
```
mc admin config set api ready_deadline=5s
```
or
```
export MINIO_API_READY_DEADLINE=5s
```
net/http exposes ErrorLog but it is log.Logger
instance not an interface which can be overridden,
because of this reason the logging is interleaved
sometimes with TLS with messages like this on the
server
```
http: TLS handshake error from 139.178.70.188:63760: EOF
```
This is bit problematic for us as we need to have
consistent logging view for allow --json or --quiet
flags.
With this PR we ensure that this format is adhered to.
Groups information shall be now stored as part of the
credential data structure, this is a more idiomatic
way to support large LDAP groups.
Avoids the complication of setups where LDAP groups
can be in the range of 150+ which may lead to excess
HTTP header size > 8KiB, to reduce such an occurrence
we shall save the group information on the server as
part of the credential data structure.
Bonus change support multiple mapped policies, across
all types of users.
This PR is a continuation from #9586, now the
entire parsing logic is fully merged into
bucket metadata sub-system, simplify the
quota API further by reducing the remove
quota handler implementation.
Shuffling arguments that we pass to MinIO server are supported. However,
when that happens, Prometheus returns wrong information about disks usage
and online/offline status.
The commit fixes the issue by avoiding relying on xl.endpoints since
it is not ordered.
this is a major overhaul by migrating off all
bucket metadata related configs into a single
object '.metadata.bin' this allows us for faster
bootups across 1000's of buckets and as well
as keeps the code simple enough for future
work and additions.
Additionally also fixes#9396, #9394
To avoid this issue with refCounter refactor the code
such that
- locker() always increases refCount upon success
- unlocker() always decrements refCount upon success
(as a special case removes the resource if the
refCount is zero)
By these two assumptions we are able to see that we
are never granted two write lockers in any situation.
Thanks to @vcabbage for writing a nice reproducer.
enable linter using golangci-lint across
codebase to run a bunch of linters together,
we shall enable new linters as we fix more
things the codebase.
This PR fixes the first stage of this
cleanup.
There is a disparency of behavior under Linux & Windows about
the returned error when trying to rename a non existant path.
err := os.Rename("/path/does/not/exist", "/tmp/copy")
Linux:
isSysErrNotDir(err) = false
os.IsNotExist(err) = true
Windows:
isSysErrNotDir(err) = true
os.IsNotExist(err) = true
ENOTDIR in Linux is returned when the destination path
of the rename call contains a file in one of the middle
segments of the path (e.g. /tmp/file/dst, where /tmp/file
is an actual file not a directory)
However, as shown above, Windows has more scenarios when
it returns ENOTDIR. For example, when the source path contains
an inexistant directory in its path.
In that case, we want errFileNotFound returned and not
errFileAccessDenied, so this commit will add a further check to close
the disparency between Windows & Linux.
The `ioutil.NopCloser(reader)` was hiding nested hash readers.
We make it an `io.Closer` so it can be attached without wrapping
and allows for nesting, by merging the requests.
The `keepHTTPResponseAlive` would cause errors to be
returned with status OK.
- Add '32' as a filler byte until a response is ready
- '0' to indicate the response is ready to be consumed
- '1' to indicate response has an error which needs
to be returned to the caller
Clear out 'file not found' errors from dir walker, since it may be
in a folder that has been deleted since it was scanned.
This PR is to ensure that we call the relevant object
layer APIs for necessary S3 API level functionalities
allowing gateway implementations to return proper
errors as NotImplemented{}
This allows for all our tests in mint to behave
appropriately and can be handled appropriately as
well.
S3 is now natively supported by B2 cloud storage provider
there is no reason to use specialized gateway for B2 anymore,
our current S3 gateway with caching would work with B2.
Resolves#8584
requests in federated setups for STS type calls which are
performed at '/' resource should be routed by the muxer,
the assumption is simply such that requests without a bucket
in a federated setup cannot be proxied, so serve them at
current server.
This commit makes the KES client use HTTP/2
when establishing a connection to the KES server.
This is necessary since the next KES server release
will require HTTP/2.
We should allow quorum errors to be send upwards
such that caller can retry while reading bucket
encryption/policy configs when server is starting
up, this allows distributed setups to load the
configuration properly.
Current code didn't facilitate this and would have
never loaded the actual configs during rolling,
server restarts.
In large setups this avoids unnecessary data transfer
across nodes and potential locks.
This PR also optimizes heal result channel, which should
be avoided for each queueHealTask as its expensive
to create/close channels for large number of objects.
This PR allows setting a "hard" or "fifo" quota
restriction at the bucket level. Buckets that
have reached the FIFO quota configured, will
automatically be cleaned up in FIFO manner until
bucket usage drops to configured quota.
If a bucket is configured with a "hard" quota
ceiling, all further writes are disallowed.
ResponseWriter & RecordAPIStats has similar role, merge them.
This commit will also fix wrong auditing for STS and Web and others
since they are using ResponseWriter instead of the RecordAPIStats.
A user can incorrectly mounts a newly fresh disk. MinIO will detect
that it is writing with a rootfs disk and will mark it down. However,
it is hard for the user to understand what's going on.
This commit will just print a notice so it will be easy to spot
such use case.
- elasticsearch client should rely on the SDK helpers
instead of pure HTTP calls.
- webhook shouldn't need to check for IsActive() for
all notifications, failure should be delayed.
- Remove DialHTTP as its never used properly
Fixes#9460
allow generating service accounts for temporary credentials
which have a designated parent, currently OpenID is not yet
supported.
added checks to ensure that service account cannot generate
further service accounts for itself, service accounts can
never be a parent to any credential.
Audit was not working properly when enabled from the environment
caused by a typo in the code.
This commit fixes that but also consider the following variables:
`MINIO_LOGGER_WEBHOOK_ENABLE_*` and
`MINIO_AUDIT_WEBHOOK_ENABLE_*` so the user can use
this latter to temporarily disable a logger or audit configuration.
data usage tracker and crawler seem to be logging
non-actionable information on console, which is not
useful and is fixed on its own in almost all deployments,
lets keep this logging to minimal.
it is possible in many screnarios that even
if the divisible value is optimal, we may
end up with uneven distribution due to number
of nodes present in the configuration.
added code allow for affinity towards various
ellipses to figure out optimal value across
ellipses such that we can always reach a
symmetric value automatically.
Fixes#9416
By monitoring PUT/DELETE and heal operations it is possible
to track changed paths and keep a bloom filter for this data.
This can help prioritize paths to scan. The bloom filter can identify
paths that have not changed, and the few collisions will only result
in a marginal extra workload. This can be implemented on either a
bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 1M
entries. A bloom table of this size is about ~2500 bytes when serialized.
To not force a full scan of all paths that have changed cycle bloom
filters would need to be kept, so we guarantee that dirty paths have
been scanned within cycle runs. Until cycle bloom filters have been
collected all paths are considered dirty.
this commit avoids lots of tiny allocations, repeated
channel creates which are performed when filtering
the incoming events, unescaping a key just for matching.
also remove deprecated code which is not needed
anymore, avoids unexpected data structure transformations
from the map to slice.
we have policy available for sub-admin users to set/get/delete
config, but we incorrectly decrypt the content using admin secret
key which in-fact should be the credential authenticating the
request.
global WORM mode is a complex piece for which
the time has passed, with the advent of S3 compatible
object locking and retention implementation global
WORM is sort of deprecated, this has been mentioned
in our documentation for some time, now the time
has come for this to go.
re-implement the cache purging routine to
avoid using ioutil.ReadDir which can lead
to high allocations when there are cache
directories with lots of content, or
when cache is installed in memory constrainted
environments.
Instead rely on a callback function where we
are not using memory no-more than 8KiB per
cycle.
Precursor for this change refer #9425, original
issue pointed by Caleb Case <caleb@storj.io>
OSS go sdk lacks licensing terms in their
repository, and there has been no activity
On the issue here https://github.com/aliyun/aliyun-oss-go-sdk/issues/245
This PR is to ensure we remove any dependency code which
lacks explicit license file in their repo.
- keep long running obd network tests alive
- fix error - wrong number of parents in process OBD info
- ensure that osinfo does not error out when inside containers
- remove limit on max number of connections per client transport
The generic client transport uses a default limit of 64 conns per transport.
This could end up limiting and throttling usage, and artificially slowing
down the performance of MinIO even on hardware capable of doing better.
New value defaults to 100K events by default,
but users can tune this value upto any value
they seem necessary.
* increase the limit to maxint64 while validating
This PR also fixes issues when
deletePolicy, deleteUser is idempotent so can lead to
issues when client can prematurely timeout, so a retry
call error response should be ignored when call returns
http.StatusNotFound
Fixes#9347
allow `mc admin config set mygateway/ audit_webhook --env`
to fetch the documentation as needed, this is just to
ensure that our users can still access the relevant
ENV docs while running in gateway mode.
Instead of GlobalContext use a local context for tests.
Most notably this allows stuff created to be shut down
when tests using it is done. After PR #9345 9331 CI is
often running out of memory/time.
Add two new configuration entries, api.requests-max and
api.requests-deadline which have the same role of
MINIO_API_REQUESTS_MAX and MINIO_API_REQUESTS_DEADLINE.
This PR fixes couple of behaviors with service accounts
- not need to have session token for service accounts
- service accounts can be generated by any user for themselves
implicitly, with a valid signature.
- policy input for AddNewServiceAccount API is not fully typed
allowing for validation before it is sent to the server.
- also bring in additional context for admin API errors if any
when replying back to client.
- deprecate GetServiceAccount API as we do not need to reply
back session tokens
- Introduced a function `FetchRegisteredTargets` which will return
a complete set of registered targets irrespective to their states,
if the `returnOnTargetError` flag is set to `False`
- Refactor NewTarget functions to return non-nil targets
- Refactor GetARNList() to return a complete list of configured targets
- Removes PerfInfo admin API as its not OBDInfo
- Keep the drive path without the metaBucket in OBD
global latency map.
- Remove all the unused code related to PerfInfo API
- Do not redefined global mib,gib constants use
humanize.MiByte and humanize.GiByte instead always
if needed use --no-compat to disable md5sum while
verifying any performance numbers.
bring back --compat behavior as default to avoid
additional documentation and confusing behavior,
as we are working towards improving md5sum to
be faster on AVX instructions, enabling this
should be hardly a problem in future versions
of MinIO.
fixes#8012fixes#7859fixes#7642
Continuing from previous PR #9304, comment
is a special key is not present in the
default KV list. Add it explicitly when
tokenizing fields as it may be possible that
some clients might try to set comments.
This PR adds context-based `k=v` splits based
on the sub-system which was obtained, if the
keys are not provided an error will be thrown
during parsing, if keys are provided with wrong
values an error will be thrown. Keys can now
have values which are of a much more complex
form such as `k="v=v"` or `k=" v = v"`
and other variations.
additionally, deprecate unnecessary postgres/mysql
configuration styles, support only
- connection_string for Postgres
- dsn_string for MySQL
All other parameters are removed.
This commit fixes a performance issue caused
by too many calls to the external KMS - i.e.
for single-part PUT requests.
In general, the issue is caused by a sub-optimal
code structure. In particular, when the server
encrypts an object it requests a new data encryption
key from the KMS. With this key it does some key
derivation and encrypts the object content and
ETag.
However, to behave S3-compatible the MinIO server
has to return the plaintext ETag to the client
in case SSE-S3.
Therefore, the server code used to decrypt the
(previously encrypted) ETag again by requesting
the data encryption key (KMS decrypt API) from
the KMS.
This leads to 2 KMS API calls (1 generate key and
1 decrypt key) per PUT operation - while only
one KMS call is necessary.
This commit fixes this by fetching a data key only
once from the KMS and keeping the derived object
encryption key around (for the lifetime of the request).
This leads to a significant performance improvement
w.r.t. to PUT workloads:
```
Operation: PUT
Operations: 161 -> 239
Duration: 28s -> 29s
* Average: +47.56% (+25.8 MiB/s) throughput, +47.56% (+2.6) obj/s
* Fastest: +55.49% (+34.5 MiB/s) throughput, +55.49% (+3.5) obj/s
* 50% Median: +58.24% (+32.8 MiB/s) throughput, +58.24% (+3.3) obj/s
* Slowest: +1.83% (+0.6 MiB/s) throughput, +1.83% (+0.1) obj/s
```
Fixes#8667
In addition to the above, if the user is mapped to a policy or
belongs in a group, the user-info API returns this information,
but otherwise, the API will now return a non-existent user error.
make rest of the Walk() function more predictable,
it was observed that in nominal deployments even
without much workload the drives are generally
slow for respond for readdir operations, for the
sleepDuration factor of 10 this can cause
unexpected slowness in the Listing calls, while
it is good for all other I/O, it may simply slow
down Listing immensely which is not useful.
fixes#9261
In FS mode under Windows, removing an object will not automatically.
remove parent empty prefixes.
The reason is that path.Dir() was used, however filepath.Dir() is
more appropriate since filepath is physical (meaning it operates
on OS filesystem paths)
This is not caught because failure for Windows CI is not caught.
fs-v1 in server mode only checks to see if the path exist, so that it
returns ready before it is indeed ready.
This change adds a check to ensure that the global object api is
available too before reporting ready.
Fixes#9283
It is some times common and convenient to use
just local IPs for testing purposes, 127.0.0.x
are special IPs regardless of being available on
an interface they can be bound to on all operating
systems.
Allow this behavior to work for minio server
fixes#9274
also, bring in an additional policy to ensure that
force delete bucket is only allowed with the right
policy for the user, just DeleteBucketAction
policy action is not enough.
This PR also tries to simplify the approach taken in
object-locking implementation by preferential treatment
given towards full validation.
This in-turn has fixed couple of bugs related to
how policy should have been honored when ByPassGovernance
is provided.
Simplifies code a bit, but also duplicates code intentionally
for clarity due to complex nature of object locking
implementation.
Too many deployments come up with an odd number
of hosts or drives, to facilitate even distribution
among those setups allow for odd and prime numbers
based packs.
- B2 does actually return an MD5 hash for newly uploaded objects
so we can use it to provide better compatibility with S3 client
libraries that assume the ETag is the MD5 hash such as boto.
- depends on change in blazer library.
- new behaviour is only enabled if MinIO's --compat mode is active.
- behaviour for multipart uploads is unchanged (works fine as is).
- Implement a graph algorithm to test network bandwidth from every
node to every other node
- Saturate any network bandwidth adaptively, accounting for slow
and fast network capacity
- Implement parallel drive OBD tests
- Implement a paging mechanism for OBD test to provide periodic updates to client
- Implement Sys, Process, Host, Mem OBD Infos
- total number of S3 API calls per server
- maximum wait duration for any S3 API call
This implementation is primarily meant for situations
where HDDs are not capable enough to handle the incoming
workload and there is no way to throttle the client.
This feature allows MinIO server to throttle itself
such that we do not overwhelm the HDDs.
- acquire since leader lock for all background operations
- healing, crawling and applying lifecycle policies.
- simplify lifecyle to avoid network calls, which was a
bug in implementation - we should hold a leader and
do everything from there, we have access to entire
name space.
- make listing, walking not interfere by slowing itself
down like the crawler.
- effectively use global context everywhere to ensure
proper shutdown, in cache, lifecycle, healing
- don't read `format.json` for prometheus metrics in
StorageInfo() call.
- Add conservative timeouts upto 3 minutes
for internode communication
- Add aggressive timeouts of 30 seconds
for gateway communication
Fixes#9105Fixes#8732Fixes#8881Fixes#8376Fixes#9028
This is to improve responsiveness for all
admin API operations and allowing callers
to cancel any on-going admin operations,
if they happen to be waiting too long.
canonicalize the ENVs such that we can bring these ENVs
as part of the config values, as a subsequent change.
- fix location of per bucket usage to `.minio.sys/buckets/<bucket_name>/usage-cache.bin`
- fix location of the overall usage in `json` at `.minio.sys/buckets/.usage.json`
(avoid conflicts with a bucket named `usage.json` )
- fix location of the overall usage in `msgp` at `.minio.sys/buckets/.usage.bin`
(avoid conflicts with a bucket named `usage.bin`
As an optimization of the healing, HealObjects() avoid sending an
object to the background healing subsystem when the object is
present in all disks.
However, HealObjects() should have checked the scan type, if this
deep, always pass the object to the healing subsystem.
Currently, a tree walking, needed to a list objects in a specific
set quits listing as long as it finds no entries in a disk, which
is wrong.
This affected background healing, because the latter is using
tree walk directly. If one object does not exist in the first
disk for example, it will be seemed like the object does not
exist at all and no healing work is needed.
This commit fixes the behavior.
The staleness of a lock should be determined by
the quorum number of entries returning stale,
this allows for situations when locks are held
when nodes are down - we don't accidentally
clear locks unintentionally when they are valid
and correct.
Also lock maintenance should be run by all servers,
not one server, stale locks need to be run outside
the requirement for holding distributed locks.
Thanks @klauspost for reproducing this issue
Some AWS SDKs latently rely on this value some times
to calculate the right number of parts during a parallel
GetObject request, this is feature used along with
content-range - we should support this as well.
- avoid setting last heal activity when starting self-healing
This can be confusing to users thinking that the self healing
cycle was already performed.
- add info about the next background healing round
OperationTimedout error occurs when locking
timesout, trying to acquire a lock. This
error should be returned appropriately to
the client with http status "408" (request timedout)
This translation was broken, fix it.
Bulk delete API was using cleanupObjectsBulk() which calls posix
listing and delete API to remove objects internal files in the
backend (xl.json and parts) one by one.
Add DeletePrefixes in the storage API to remove the content
of a directory in a single call.
Also use a remove goroutine for each disk to accelerate removal.
Currently the code assumed some orthogonal requirements
which led situations where when we have a setup where
we have let's say for example 168 drives, the final
set_drive_count chosen was 14. Indeed 168 drives are
divisible by 12 but this wasn't allowed due to an
unexpected requirement to have 12 to be a perfect modulo
of 14 which is not possible. This assumption was incorrect.
This PR fixes this old assumption properly, also adds
few tests and some negative tests as well. Improvements
are seen in error messages as well.
- Remove the requirement to honor storage class for deletes
- Improve `posix.DeleteFileBulk` code to Stat the volumeDir
only once per call, rather than for all object paths.
Recent modification in the code led to incorrect calculation
of offline disks.
This commit saves the endpoint list in a xlObjects then we know
the name of each disk.
lock ownership is limited to endpoints on first zone,
as we do not hold locks on other zones in an expanded
setup. current code unintentionally expired active locks
when it couldn't see ownership from the secondary zone
which leads to unexpected bugs as locking fails to work
as expected.
this PR enforces md5sum verification for following
API's to be compatible with AWS S3 spec
- PutObjectRetention
- PutObjectLegalHold
Co-authored-by: Harshavardhana <harsha@minio.io>
Allow downloading goroutine dump to help detect leaks
or overuse of goroutines.
Extensions are now type dependent.
Change `profiling` -> `profile` prefix, since that is what they are
not the abstract concept.
This is a precursor change before versioning,
removes/deprecates the requirement of remembering
partName and partETag which are not useful after
a multipart transaction has finished.
This PR reduces the overall size of the backend
JSON for large file uploads.
For a non-existent user server would return STS not initialized
```
aws --profile harsha --endpoint-url http://localhost:9000 \
sts assume-role \
--role-arn arn:xxx:xxx:xxx:xxxx \
--role-session-name anything
```
instead return an appropriate error as expected by STS API
Additionally also format the `trace` output for STS APIs
Upgrades between releases are failing due to strict
rule to avoid rolling upgrades, it is enough to
bump up APIs between versions to allow for quorum
failure and wait times. Authentication failures are
catastrophic in nature which leads to server not
be able to upgrade properly.
Fixes#9021Fixes#8968
To allow better control the cache eviction process.
Introduce MINIO_CACHE_WATERMARK_LOW and
MINIO_CACHE_WATERMARK_HIGH env. variables to specify
when to stop/start cache eviction process.
Deprecate MINIO_CACHE_EXPIRY environment variable. Cache
gc sweeps at 30 minute intervals whenever high watermark is
reached to clear least recently accessed entries in the cache
until sufficient space is cleared to reach the low watermark.
Garbage collection uses an adaptive file scoring approach based
on last access time, with greater weights assigned to larger
objects and those with more hits to find the candidates for eviction.
Thanks to @klauspost for this file scoring algorithm
Co-authored-by: Klaus Post <klauspost@minio.io>
Change distributed locking to allow taking bulk locks
across objects, reduces usually 1000 calls to 1.
Also allows for situations where multiple clients sends
delete requests to objects with following names
```
{1,2,3,4,5}
```
```
{5,4,3,2,1}
```
will block and ensure that we do not fail the request
on each other.
Metrics used to have its own code to calculate offline disks.
StorageInfo() was avoided because it is an expensive operation
by sending calls to all nodes.
To make metrics & server info share the same code, a new
argument `local` is added to StorageInfo() so it will only
query local disks when needed.
Metrics now calls StorageInfo() as server info handler does
but with the local flag set to false.
Co-authored-by: Praveen raj Mani <praveen@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Add dummy calls which respond success when ACL's
are set to be private and fails, if user tries
to change them from their default 'private'
Some applications such as nuxeo may have an
unnecessary requirement for this operation,
we support this anyways such that don't have
to fully implement the functionality just that
we can respond with success for default ACLs
Avoid GetObjectNInfo call from cache in CopyObjectHandler
- in the case of server side copy with metadata replacement,
the reader returned from cache is never consumed, but the net
effect of GetObjectNInfo from cache layer, is cache holding a
write lock to fill the cache. Subsequent stat operation on cache in
CopyObject is not able to acquire a read lock, thus causing the hang.
Fixes#8991
we don't need to validateFormats again once we have obtained
reference format, because it is possible that at this stage
another server is doing a disk heal during startup, once
in a while due to delays we get false positives and our
server doesn't start.
Format in quorum as reference format can be assumed as valid
and we proceed further, until and unless HealFormat re-inits
the disks after a successful heal.
Also use separate port for healing tests to avoid any
conflicts with regular build testing.
Fixes#8884
RegisterNotificationTargets() cleans up all connections
that it makes to notification targets when an error occurs
during its execution.
However there is a typo in the code that makes the function to always
try to access to a nil pointer in the defer code since the function
in question will always return nil in the case of any error.
This commit fixes the typo in the code.
http.Request.ContentLength can be negative, which affects
the gateway_s3_bytes_received value in Prometheus output.
The commit only increases the value of the total received bytes
in gateway mode when r.ContentLength is greater than zero.
First step is to ensure that Path component is not decoded
by gorilla/mux to avoid routing issues while handling
certain characters while uploading through PutObject()
Delay the decoding and use PathUnescape() to escape
the `object` path component.
Thanks to @buengese and @ncw for neat test cases for us
to test with.
Fixes#8950Fixes#8647
We added support for caching and S3 related metrics in #8591. As
a continuation, it would be helpful to add support for Azure & GCS
gateway related metrics as well.
The logging subsystem was initialized under init() method in
both gateway-main.go and server-main.go which are part of
same package. This created two logging targets and hence
errors were logged twice. This PR moves the init() method
to common-main.go
Streams are returning a readcloser and returning would
decrement io count instantly, fix it.
change maxActiveIOCount to 3, meaning it will pause
crawling if 3 operations are running.
Remove the random sleep. This is running in 4 goroutines,
so mostly doing nothing.
We use the getSize latency to estimate system load,
meaning when there is little load on the system and
we get the result fast we sleep a little.
If it took a long time we have high load and release
ourselves longer.
We are sleeping inside the mutex so this affects all
goroutines doing IO.
Due to a typo in the code, a cluster was not correctly creating
`background-ops` in all disks and nodes print the following error:
minio3_1 | API: SYSTEM()
minio3_1 | Time: 19:32:45 UTC 02/06/2020
minio3_1 | DeploymentID: d67c20fa-4a1e-41f5-b319-7e3e90f425d8
minio3_1 | Error: Bucket not found: .minio.sys/background-ops
minio3_1 | 2: cmd/data-usage.go:109:cmd.runDataUsageInfo()
minio3_1 | 1: cmd/data-usage.go:56:cmd.runDataUsageInfoUpdateRoutine()
This commit fixes the typo.
Adding mutex slows down the crawler to avoid large
spikes in CPU, also add millisecond interval jitter
in calculation of disk usage to slow down the spikes
further.
This is to fix a situation where an object name incorrectly
is sent with '//' in its path heirarchy, we should reject
such object names because they may be hashed to a set where
the object might not originally belong because, this can
cause situations where once object is uploaded we cannot
delete it anymore.
Fixes#8873
This commit fixes typos in the displayed server info
w.r.t. the KMS and removes the update status.
For more information about why the update status
is removed see: PR #8943
This commit removes the `Update` functionality
from the admin API. While this is technically
a breaking change I think this will not cause
any harm because:
- The KMS admin API is not complete, yet.
At the moment only the status can be fetched.
- The `mc` integration hasn't been merged yet.
So no `mc` client could have used this API
in the past.
The `Update`/`Rewrap` status is not useful anymore.
It provided a way to migrate from one master key version
to another. However, KES does not support the concept of
key versions. Instead, key migration should be implemented
as migration from one master key to another.
Basically, the `Update` functionality has been implemented just
for Vault.
- pkg/bucket/encryption provides support for handling bucket
encryption configuration
- changes under cmd/ provide support for AES256 algorithm only
Co-Authored-By: Poorna <poornas@users.noreply.github.com>
Co-authored-by: Harshavardhana <harsha@minio.io>
XL crawling wrongly returns a zero buckets count when
there are no objects uploaded in the server yet. The reason is
data of the crawler of posix returns invalid result when all
disks has zero objects.
A simple fix is to always pick the crawling result of the first
disk but choose over the result of the disk which has the most
objects in it.
looks like 1024 buffer size is not enough in
all situations, use 8192 instead which
can satisfy all the rare situations that
may arise in base64 decoding.
multi-delete API failed with write quorum errors
under following situations
- list of files requested for delete doesn't exist
anymore can lead to quorum errors and failure
- due to usage of query param for paths, for really
long paths MinIO server rejects these requests as
malformed as unexpected.
This was reproduced with warp
The server info handler makes a http connection to other
nodes to check if they are up but does not load the custom
CAs in ~/.minio/certs/CAs.
This commit fix it.
Co-authored-by: Harshavardhana <harsha@minio.io>
JWT parsing is simplified by using a custom claim
data structure such as MapClaims{}, also writes
a custom Unmarshaller for faster unmarshalling.
- Avoid as much reflections as possible
- Provide the right types for functions as much
as possible
- Avoid strings.Join, strings.Split to reduce
allocations, rely on indexes directly.
For 'snapshot' type profiles, record a 'before' profile that can be used
as `go tool pprof -base=before ...` to compare before and after.
"Before" profiles are included in the zipped package.
[`runtime.MemProfileRate`](https://golang.org/pkg/runtime/#pkg-variables)
should not be updated while the application is running, so we set it at startup.
Co-authored-by: Harshavardhana <harsha@minio.io>
On every restart of the server, usage was being
calculated which is not useful instead wait for
sufficient time to start the crawling routine.
This PR also avoids lots of double allocations
through strings, optimizes usage of string builders
and also avoids crawling through symbolic links.
Fixes#8844
Choosing maxAllowedIOError is arbitrary and
prone to errors, when drives might be perfectly
capable of taking I/O with only few locations
return I/O error. This is a hindrance of sort
where backend filesystems like ZFS can automatically
fix and handle these scenarios.
The added problem with current approach that we
take the drive offline, making it virtually impossible
to bring it online without restart the server which
is not desirable on a busy cluster. Remove this state
such that let the backend return error appropriately
to caller and let the caller decide what to do with
the error.
return http.ErrServerClosed with proper body when
server is shutting down, allowing more context instead
of just returning '503' which doesn't mean the same
thing.
code of this form is always racy, when the
map itself is being written to as well
```
func (r Map) retMap() map[string]string {
.. lock ..
return r.internalMap
}
func (r Map) addMap(k, v string) {
.. lock ..
r.internalMap[k] = v
}
```
Anyone reading from `retMap()` is not protected
because of locking and we need to make sure
to avoid code in this manner. Always safe to
copy the map and return.
Zone abstraction of object layer was returning `nil`
incorrectly under situations where disk healing is
not required. Returning `nil` is considered as healing
successful, which leads to unexpected ReloadFormat()
peer notification calls during startup.
This PR fixes this behavior properly for zones.
Meta volumes directories, tmp/, background-ops/, etc..
undr .minio.sys are created when disks are formatted
but also when the cluster is started.
However using MakeVolBulk() is not appropriate in the
case of a user migrating from a version which does not
have .minio.sys/background-ops/. The reason is that
MakeVolBulk() exits early when an error is occured:
errVolumeExists in this case, which is expected since
some directories such as tmp/ already exist.
This commit will avoid use MakeVolBulk and use MakeVol
instead.
Also the PR will make each node creates meta volumes
in its local disks and stop relying on the first disk
since the first node could be offline.
instead perform a liveness check call to
verify if server is online and print relevant
errors.
Also introduce a StorageErr string error type
instead of errors.New() deprecate usage of
VerifyFileError, DeleteFileError for gob,
change in datastructure also requires bump in
storage REST version to v13.
Fixes#8811
Enabling the memory profiling has a significant impact on performance.
Reduce the profiling rate by 2 orders of magnitude. It is still 128x smaller than default so it should be plenty.
object lock config is enabled for a bucket.
Creating a bucket with object lock configuration
enabled does not automatically cause WORM protection
to be applied. PUT operation needs to specifically
request object locking or bucket has to have default
retention settings configured.
Fixes regression introduced in #8657
When formatting a set validate if a host failure will likely lead to data loss.
While we don't know what config will be set in the future
evaluate to our best knowledge, assuming default settings.
X-Cache sets cache status of HIT if object is
served from the disk cache, or MISS otherwise.
X-Cache-Lookup is set to HIT if object was found
in the cache even if not served (for e.g. if cache
entry was invalidated by ETag verification)
A new key was added in identity_openid recently
required explicitly for client to set the optional
value without that it would be empty, handle this
appropriately.
Fixes#8787
Use reference format to initialize lockers
during startup, also handle `nil` for NetLocker
in dsync and remove *errorLocker* implementation
Add further tuning parameters such as
- DialTimeout is now 15 seconds from 30 seconds
- KeepAliveTimeout is not 20 seconds, 5 seconds
more than default 15 seconds
- ResponseHeaderTimeout to 10 seconds
- ExpectContinueTimeout is reduced to 3 seconds
- DualStack is enabled by default remove setting
it to `true`
- Reduce IdleConnTimeout to 30 seconds from
1 minute to avoid idleConn build up
Fixes#8773
- Stop spawning store replay routines when testing the notification targets
- Properly honor the target.Close() to clean the resources used
Fixes#8707
Co-authored-by: Harshavardhana <harsha@minio.io>
In certain organizations policy claim names
can be not just 'policy' but also things like
'roles', the value of this field might also
be *string* or *[]string* support this as well
In this PR we are still not supporting multiple
policies per STS account which will require a
more comprehensive change.
Exponential backoff does not seem like a good fit for
this function since we can expect a few roundtrips on
initial startup.
This retry loop get slow pretty quickly with initial
wait being 1 second and each try being double the
wait until 30 seconds is reached.
Instead simply try 2 times per second.
This PR adds jsoniter package to replace encoding/json
in places where faster json unmarshal is necessary
whenever input JSON is large enough.
Some benchmarking comparison between jsoniter and enconding/json
benchmark old MB/s new MB/s speedup
BenchmarkParseUnmarshal/N10-4 110.02 331.17 3.01x
BenchmarkParseUnmarshal/N100-4 125.74 524.09 4.17x
BenchmarkParseUnmarshal/N500-4 131.68 542.60 4.12x
BenchmarkParseUnmarshal/N1000-4 133.93 514.88 3.84x
BenchmarkParseUnmarshal/N5000-4 122.10 415.36 3.40x
BenchmarkParseUnmarshal/N10000-4 132.13 403.90 3.06x
Fixes scenario where zones are appropriately
handled, along with supporting overriding set
count. The new fix also ensures that we handle
the various setup types properly.
Update documentation to properly indicate the
behavior.
Fixes#8750
Co-authored-by: Nitish Tiwari <nitish@minio.io>
Currently when connections to vault fail, client
perpetually retries this leads to assumptions that
the server has issues and masks the problem.
Re-purpose *crypto.Error* type to send appropriate
errors back to the client.
In existing functionality we simply return a generic
error such as "MalformedPolicy" which indicates just
a generic string "invalid resource" which is not very
meaningful when there might be multiple types of errors
during policy parsing. This PR ensures that we send
these errors back to client to indicate the actual
error, brings in two concrete types such as
- iampolicy.Error
- policy.Error
Refer #8202
minio server /data{1..4} shows an error about inability to bind a port, though
the real problem is /data{1..4} cannot be created because of the lack of
permissions.
This commit fix the behavior.
- We should declare a cluster ready even if read quorum is achieved (atleast n/2 disks are online).
- Such that, all the zones should have enough read quorum. Thus making the cluster ready for reads.
We had messy cyclical dependency problem with `mc`
due to dependencies in pkg/console, moved the pkg/console
to minio for more control and also to avoid any further
cyclical dependencies of `mc` clobbering up the
dependencies on server.
Fixes#8659
If two distinct clusters are started with different domains
along with single common domain, this situation was leading
to conflicting buckets getting created on different clusters
To avoid this do not prematurely error out if the key has no
entries, let the caller decide on which entry matches and
which entry is valid. This allows support for MINIO_DOMAIN
with one common domain, but each cluster may have their own
domains.
Fixes#8705
This is to ensure that when we have multiple tenants
deployed all sharing the same etcd for global bucket
should avoid listing each others buckets, this leads
to information leak which should be avoided unless
etcd is not namespaced for IAM assets in which case
it can be assumed that its a federated setup.
Federated setup and namespaced IAM assets on etcd
is not supported since namespacing is only useful
when you wish to separate the tenants as isolated
instances of MinIO.
This PR allows a new type of behavior, primarily
driven by the usecase of m3(mkube) multi-tenant
deployments with global bucket support.
Currently all bucket events are sent to all watchers
with matching prefix and event names, this becomes
problematic and prone to performance issues, fix this
situation by filtering based on buckets as well.
This PR fixes the issue where we might allow policy changes
for temporary credentials out of band, this situation allows
privilege escalation for those temporary credentials. We
should disallow any external actions on temporary creds
as a practice and we should clearly differentiate which
are static and which are temporary credentials.
Refer #8667
Changes in IP underneath are dynamic in replica sets
with multiple tenants, so deploying in that fashion
will not work until we wait for atleast one participatory
server to be local.
This PR also ensures that multi-tenant zone expansion also
works in replica set k8s deployments.
Introduces a new ENV `KUBERNETES_REPLICA_SET` check to call
appropriate code paths.
Continuation from #8629 which basically broke
zone deployments on k8s statefulset environment
due to incorrect assumptions which made it work
on replicated set.
Fix this properly such that this container works
for both replicated set and stateful set deployment
This commit removes github.com/minio/kes as
a dependency and implements the necessary
client-side functionality without relying
on the KES project.
This resolves the licensing issue since
KES is licensed under AGPL while MinIO
is licensed under Apache.
With this PR,cache eviction will continue until
no LRU entries older than an hour can be cache
evicted or sufficient percentage of disk space
has been reclaimed.
This ensures that we can update the
- .minio.sys is updated for accounting/data usage purposes
- .minio.sys is updated to indicate if backend is encrypted
or not.
The approach is that now safe mode is only invoked when
we cannot read the config or under some catastrophic
situations, but not under situations when config entries
are invalid or unreachable. This allows for maximum
availability for MinIO and not fail on our users unlike
most of our historical releases.
This commit adds support for the minio/kes KMS.
See: https://github.com/minio/kes
In particular you can configure it as KMS by:
- `export MINIO_KMS_KES_ENDPOINT=` // Server URL
- `export MINIO_KMS_KES_KEY_FILE=` // TLS client private key
- `export MINIO_KMS_KES_CERT_FILE=` // TLS client certificate
- `export MINIO_KMS_KES_CA_PATH=` // Root CAs issuing server cert
- `export MINIO_KMS_KES_KEY_NAME=` // The name of the (default)
master key
Admin data usage info API returns the following
(Only FS & XL, for now)
- Number of buckets
- Number of objects
- The total size of objects
- Objects histogram
- Bucket sizes
In replica sets, hosts resolve to localhost
IP automatically until the deployment fully
comes up. To avoid this issue we need to
wait for such resolution.
Final update to all messages across sub-systems
after final review, the only change here is that
NATS now has TLS and TLSSkipVerify to be consistent
for all other notification targets.
This PR adds support below metrics
- Cache Hit Count
- Cache Miss Count
- Data served from Cache (in Bytes)
- Bytes received from AWS S3
- Bytes sent to AWS S3
- Number of requests sent to AWS S3
Fixes#8549
Fixes an issue reported by @klauspost and @vadmeste
This PR also allows users to expand their clusters
from single node XL deployment to distributed mode.
Currently, we use the top-level prefix "config/"
for all our IAM assets, instead of to provide
tenant-level separation bring 'path_prefix'
to namespace the access properly.
Fixes#8567
StorageInfo() call is supposed to give each
server/disk information independently, rely
on this appropriately so that `mc admin info server`
gets correct information all the time.
Regression from #8509 which changes objectsToDelete entry
from a list to map. This will cause index out of range
panic if object is not selected for delete.
level - this PR builds on #8120 which
added PutBucketObjectLockConfiguration and
GetBucketObjectLockConfiguration APIS
This PR implements PutObjectRetention,
GetObjectRetention API and enhances
PUT and GET API operations to display
governance metadata if permissions allow.
- added ability to specify CA for self-signed certificates
- added option to authenticate using client certificates
- added unit tests for nats connections
We should only read ahead if we are reading big files. We enable it for files >= 16MB.
Benchmark on 64KB objects.
Before:
```
Operation: GET
Errors: 0
Average: 59.976s, 87.13 MB/s, 1394.07 ops ended/s.
Fastest: 1s, 90.99 MB/s, 1455.00 ops ended/s.
50% Median: 1s, 87.53 MB/s, 1401.00 ops ended/s.
Slowest: 1s, 81.39 MB/s, 1301.00 ops ended/s.
```
After:
```
Operation: GET
Errors: 0
Average: 59.992s, 207.99 MB/s, 3327.85 ops ended/s.
Fastest: 1s, 219.20 MB/s, 3507.00 ops ended/s.
50% Median: 1s, 210.54 MB/s, 3368.00 ops ended/s.
Slowest: 1s, 179.14 MB/s, 2865.00 ops ended/s.
```
The 64KB buffer is actually a small disadvantage for this case, but I believe it will be better in general than no buffer.
- Migrate and save only settings which are enabled
- Rename logger_http to logger_webhook and
logger_http_audit to audit_webhook
- No more pretty printing comments, comment
is a key=value pair now.
- Avoid quotes on values which do not have space in them
- `state="on"` is implicit for all SetConfigKV unless
specified explicitly as `state="off"`
- Disabled IAM users should be disabled always
The code does not properly set a new deployemnt ID when not present
in format.json: it loops twice without releasing write lock on format.json
causing an infinite locking error on the same file.
This commit fixes and simplifies a little the code.
This PR implements locking from a global entity into
a more localized set level entity, allowing for locks
to be held only on the resources which are writing
to a collection of disks rather than a global level.
In this process this PR also removes the top-level
limit of 32 nodes to an unlimited number of nodes. This
is a precursor change before bring in bucket expansion.
This PR also fixes issues related to
- Add proper newline for `mc admin config get` output
for more than one targets
- Fixes issue of temporary user credentials to have
consistent output
- Fixes a crash when setting a key with empty values
- Fixes a parsing issue with `mc admin config history`
- Fixes gateway ENV handling for etcd server and gateway
This PR fixes issues found in config migration
- StorageClass migration error when rrs is empty
- Plain-text migration of older config
- Do not run in safe mode with incorrect credentials
- Update logger_http documentation for _STATE env
Refer more reported issues at #8434
This PR refactors object layer handling such
that upon failure in sub-system initialization
server reaches a stage of safe-mode operation
wherein only certain API operations are enabled
and available.
This allows for fixing many scenarios such as
- incorrect configuration in vault, etcd,
notification targets
- missing files, incomplete config migrations
unable to read encrypted content etc
- any other issues related to notification,
policies, lifecycle etc
This PR brings support for `history` list to
list in the following agreed format
```
~ mc admin config history list -n 2 myminio
RestoreId: df0ebb1e-69b0-4043-b9dd-ab54508f2897
Date: Mon, 04 Nov 2019 17:27:27 GMT
region name="us-east-1" state="on"
region name="us-east-1" state="on"
region name="us-east-1" state="on"
region name="us-east-1" state="on"
RestoreId: ecc6873a-0ed3-41f9-b03e-a2a1bab48b5f
Date: Mon, 04 Nov 2019 17:28:23 GMT
region name=us-east-1 state=off
```
This PR also moves the help templating and coloring to
fully `mc` side instead than `madmin` API.
This PR adds code to appropriately handle versioning issues
that come up quite constantly across our API changes. Currently
we were also routing our requests wrong which sort of made it
harder to write a consistent error handling code to appropriately
reject or honor requests.
This PR potentially fixes issues
- old mc is used against new minio release which is incompatible
returns an appropriate for client action.
- any older servers talking to each other, report appropriate error
- incompatible peer servers should report error and reject the calls
with appropriate error
This is to avoid making calls to backend and requiring
gateways to allow permissions for ListBuckets() operation
just for Liveness checks, we can avoid this and make
our liveness checks to be more performant.
- Supports migrating only when the credential ENVs are set,
so any FS mode deployments which do not have ENVs set will
continue to remain as is.
- Credential ENVs can be rotated using MINIO_ACCESS_KEY_OLD
and MINIO_SECRET_KEY_OLD envs, in such scenarios it allowed
to rotate the encrypted content to a new admin key.
This commit replaces the returned error message by
the hardware info handler from `Method-Not-Allowed`
to `Bad-Request` since the current HTTP error is not
correct according to the HTTP spec.
In particular:
```
The origin server MUST generate an Allow header field
in a 405 response containing a list of the target
resource's currently supported methods.
```
From: https://tools.ietf.org/html/rfc7231#section-6.5.5
- This PR allows config KVS to be validated properly
without being affected by ENV overrides, rejects
invalid values during set operation
- Expands unit tests and refactors the error handling
for notification targets, returns error instead of
ignoring targets for invalid KVS
- Does all the prep-work for implementing safe-mode
style operation for MinIO server, introduces a new
global variable to toggle safe mode based operations
NOTE: this PR itself doesn't provide safe mode operations
The new auto healing model selects one node always responsible
for auto-healing the whole cluster, erasure set by erasure set.
If that node dies, another node will be elected as a leading
operator to perform healing.
This code also adds a goroutine which checks each 10 minutes
if there are any new unformatted disks and performs its healing
in that case, only the erasure set which has the new disk will
be healed.
While Tracing requests on server, type assertion on logger.ResponseWriter
caused nil pointer exception because of recordAPIStats{} being
used as ResponseWriter. This PR avoids the type assertion and
initializes a new logger.ResponseWriter.
Fixes regression introduced in #8003
- adding oauth support to MinIO browser (#8400) by @kanagaraj
- supports multi-line get/set/del for all config fields
- add support for comments, allow toggle
- add extensive validation of config before saving
- support MinIO browser to support proper claims, using STS tokens
- env support for all config parameters, legacy envs are also
supported with all documentation now pointing to latest ENVs
- preserve accessKey/secretKey from FS mode setups
- add history support implements three APIs
- ClearHistory
- RestoreHistory
- ListHistory
- add help command support for each config parameters
- all the bug fixes after migration to KV, and other bug
fixes encountered during testing.
The measures are consolidated to the following metrics
- `disk_storage_used` : Disk space used by the disk.
- `disk_storage_available`: Available disk space left on the disk.
- `disk_storage_total`: Total disk space on the disk.
- `disks_offline`: Total number of offline disks in current MinIO instance.
- `disks_total`: Total number of disks in current MinIO instance.
- `s3_requests_total`: Total number of s3 requests in current MinIO instance.
- `s3_errors_total`: Total number of errors in s3 requests in current MinIO instance.
- `s3_requests_current`: Total number of active s3 requests in current MinIO instance.
- `internode_rx_bytes_total`: Total number of internode bytes received by current MinIO server instance.
- `internode_tx_bytes_total`: Total number of bytes sent to the other nodes by current MinIO server instance.
- `s3_rx_bytes_total`: Total number of s3 bytes received by current MinIO server instance.
- `s3_tx_bytes_total`: Total number of s3 bytes sent by current MinIO server instance.
- `minio_version_info`: Current MinIO version with commit-id.
- `s3_ttfb_seconds_bucket`: Histogram that holds the latency information of the requests.
And this PR also modifies the current StorageInfo queries
- Decouples StorageInfo from ServerInfo .
- StorageInfo is enhanced to give endpoint information.
NOTE: ADMIN API VERSION IS BUMPED UP IN THIS PR
Fixes#7873
xl.isObject() returns 'nil' for not found disks when
calculating the existance of xl.json for a given object,
which what StatFile() is also doing (setting nil) if
xl.json exists.
This commit avoids this confusion by setting errDiskNotFound
error when the storage disk is not found.
When `mc admin user add` is attempted in gateway mode without
etcd setup, NoSuchBucket error is returned instead of MethodNotAllowed.
Regression from commit - e48005ddc7
Background append creates a temporary file which appends
uploaded parts as long as they are available, but when a
client stops the upload, the temporary file is not removed
by any way.
This commit removes the temporary file when the server does
its regular removing stale multipart uploads.
specific errors, `application` errors or `all` by default.
console logging on server by default lists all logs -
enhance admin console API to accept `type` as query parameter to
subscribe to application/minio logs.
- This PR fixes situation to avoid underflow, this is possible
because of disconnected operations in replay/sendEvents
- Hold right locks if Del() operation is performed in Get()
- Remove panic in the code and use loggerOnce
- Remove Timer and instead use Ticker instead for proper ticks
On Kubernetes/Docker setups DNS resolves inappropriately
sometimes where there are situations same endpoints with
multiple disks come online indicating either one of them
is local and some of them are not local. This situation
can never happen and its only a possibility in orchestrated
deployments with dynamic DNS. Following code ensures that we
treat if one of the endpoint says its local for a given host
it is true for all endpoints for the same host. Following code
ensures that this assumption is true and it works in all
scenarios and it is safe to assume for a given host.
This PR also adds validation such that we do not crash the
server if there are bugs in the endpoints list in dsync
initialization.
Thanks to Daniel Valdivia <hola@danielvaldivia.com> for
reproducing this, this fix is needed as part of the
https://github.com/minio/m3 project.
This change is related to larger config migration PR
change, this is a first stage change to move our
configs to `cmd/config/` - divided into its subsystems
Current master repeatedly calls ListBuckets() during
initialization of multiple sub-systems
Use single ListBuckets() call for each sub-system as
follows
- LifeCycle
- Policy
- Notification
- Heal if the part.1 is truncated from its original size
- Heal if the part.1 fails while being verified in between
- Heal if the part.1 fails while being at a certain offset
Other cleanups include make sure to flush the HTTP responses
properly from storage-rest-server, avoid using 'defer' to
improve call latency. 'defer' incurs latency avoid them
in our hot-paths such as storage-rest handlers.
Fixes#8319
Minio V2 listing uses object names/prefixes as continuation tokens. This
is problematic when object names contain some characters that are forbidden
in XML documents. This PR will use base64 encoded form of continuation
and next continuation tokens to address that corner case.
errors.errorString() cannot be marshalled by gob
encoder, so using a slice of []error would fail
to be encoded. This leads to no errors being
generated instead gob.Decoder on the storage-client
would see an io.EOF
To avoid such bugs introduce a typed error for
handling such translations and register this type
for gob encoding support.
On objects bigger than 100MiB can have a corrupted object
stored due to partial blockListing attempted right after
each blocks uploaded. Simplify this code to ensure that
all the blocks successfully uploaded are committed right
away.
This PR also updates the azure-sdk-go to latest release.
- Choose a unique uuid such that under situations of duplicate
mounts we do not append to an existing json entry.
- Avoid AppendFile instead use WriteAll() to write the entire
byte array atomically.
When url encoding is passed in v2 listing handler, continuationToken
and nextContinuationToken needs to be encoded. The reason is that
both represents an object name/prefix in Minio server and it could
contain a character unsupported by XML specification.
This PR additionally also adds support for missing
- Session policy support for AD/LDAP
- Add API request/response parameters detail
- Update example to take ldap username,
password input from the command line
- Fixes session policy handling for
ClientGrants and WebIdentity
This commit removes the SSE-S3 key rotation functionality
from CopyObject since there will be a dedicated Admin-API
for this purpose.
Also update the security documentation to link to mc and
the admin documentation.
This commit allows the MinIO server to parse the metadata if:
- either the `X-Minio-Internal-Server-Side-Encryption-S3-Key-Id`
and the `X-Minio-Internal-Server-Side-Encryption-S3-Kms-Sealed-Key`
entries are present.
- or *both* headers are not present.
This is in service to support a K/V data key store.
In commit a8296445ad we changed the code to handle
some corner cases on ARM and other platforms, this
PR just avoids the return for unknown filetypes
prematurely and let the name be populated appropriately.
This fixes bug for older XFS implementations such as
in Ubuntu 14.04
This commit fixes a subtle bug that (probably)
caused an issue affecting encrypted multipart objects.
When a cluster has no quorum this bug causes `ListObjectParts`
to return nil as error instead of a quorum error.
Thanks to @harshavardhana for detecting this.
This PR changes cache on PUT behavior to background fill the cache
after PutObject completes. This will avoid concurrency issues as in #8219.
Added cleanup of partially filled cache to prevent cache corruption
- Fixes#8208
VerifyFile in the distributed setup does not work with
the non streaming highway hash. The reason is that the
internode mux router did not expect `storageRESTBitrotHash`
parameter.
endpoint.IsLocal will not have .Host entries so
using them to skip double entries will never work.
change the code such that we look for endpoint.Host
outside of endpoint.IsLocal logic to skip double
hosts appropriately.
Move these functions to their appropriate file.
- ListObjectsHeal should list only objects
which need healing, not the entire namespace.
- DeleteObjects() to be used to delete 1000s of
objects in bulk instead of serially.
Add LDAP based users-groups system
This change adds support to integrate an LDAP server for user
authentication. This works via a custom STS API for LDAP. Each user
accessing the MinIO who can be authenticated via LDAP receives
temporary credentials to access the MinIO server.
LDAP is enabled only over TLS.
User groups are also supported via LDAP. The administrator may
configure an LDAP search query to find the group attribute of a user -
this may correspond to any attribute in the LDAP tree (that the user
has access to view). One or more groups may be returned by such a
query.
A group is mapped to an IAM policy in the usual way, and the server
enforces a policy corresponding to all the groups and the user's own
mapped policy.
When LDAP is configured, the internal MinIO users system is disabled.
It looks like from implementation point of view fastjson
parser pool doesn't behave the same way as expected
when dealing many `xl.json` from multiple disks.
The fastjson parser pool usage ends up returning incorrect
xl.json entries for checksums, with references pointing
to older entries. This led to the subtle bug where checksum
info is duplicated from a previous xl.json read of a different
file from different disk.
With this PR, liveness check responds with 200 OK with "server-not-
initialized" header while objectLayer gets initialized. The header
is removed as objectLayer is initialized. This is to allow
MinIO distributed cluster to get started when running on an
orchestration platforms like Docker Swarm.
This PR also updates sample Swarm yaml files to use correct values
for healthcheck fields.
Fixes#8140
This commit adds an admin API route and handler for
requesting status information about a KMS key.
Therefore, the client specifies the KMS key ID (when
empty / not set the server takes the currently configured
default key-ID) and the server tries to perform a dummy encryption,
re-wrap and decryption operation. If all three succeed we know that
the server can access the KMS and has permissions to generate, re-wrap
and decrypt data keys (policy is set correctly).
This is a defensive change to avoid any future issues,
from this part of the code. New change also ensures
to populate UUID if present for the right disk.
This commit fixes the web ZIP download handler for
encrypted objects. The decryption logic has moved into
`getObjectNInfo`. So trying to decrypt the (already decrypted)
content again in the ZIP handler obviously causes an error.
This commit fixes this by removing the decryption logic from the
the handler.
Fixes#7965
The underlying errors are important, for IAM
requirements and should wait appropriately at
the caller level, this allows for distributed
setups to run properly and not fail prematurely
during startup.
Also additionally fix the onlineDisk counting
The change now is to ensure that we take custom URL as
well for updating the deployment, this is required for
hotfix deliveries for certain deployments - other than
the community release.
This commit changes the previous work d65a2c6725
with newer set of requirements.
Also deprecates PeerUptime()
net.Error is very unreliable in providing better error
handling, we need to ensure that we always have a fallback
option in case of network failures.
This fixes an important issue in our distributed server
setups when one of the servers is down, all deployments
out there are recommended to upgrade after this fix is
merged to ensure that availability is not lost.
Fixes#8127Fixes#8016Fixes#7964
This API implementation simply behaves like listObjects()
but returns back single version for each object, this
implementation should be considered dummy it is only
meant for some applications which rely on this.
This is to avoid using unsafe.Pointer type
code dependency for MinIO, this causes
crashes on ARM64 platforms
Refer #8005 collection of runtime crashes due
to unsafe.Pointer usage incorrectly. We have
seen issues like this before when using
jsoniter library in the past.
This PR hopes to fix this using fastjson
Add better dynamic timeouts for locks, also
add jitters before launching daily sweep to ensure
that not all the servers in distributed setup
are not trying to hold locks to begin the sweep
round.
Also, add enough delay for incoming requests based
on totalSetCount*totalDriveCount.
A possible fix for #8071
It is observed that when `mc admin trace` is being
used due to ResponseWriter wrapper, we loose information
about statusCode,statusText for audit logging.
This PR fixes this behavior
This avoids a network call, also fixes an issue
when empty paths are passed the underlying call
fails with "405 Method Not Allowed".
This is reproducible when you are deleting a
non-existent object.
Fixes#8083
Add API to set policy mapping for a user or group
Contains a breaking Admin APIs change.
- Also enforce all applicable policies
- Removes the previous /set-user-policy API
Bump up peerRESTVersion
Add get user info API to show groups of a user
* Cleanup ui-errors and print proper error messages
Change HELP to HINT instead, handle more error
cases when starting up MinIO. One such is related
to #8048
* Apply suggestions from code review
When a peer client which higher version sends a request to a peer
server with lower version, the returned status code is 200 OK instead
of 405 code. The reason is that the peer client request reaches the
browser handler, which registers itself by '/minio' route but without
any other constraints. Adding filtering by user agent header to the
browser route so internal requests to old endpoints versions return
405 error code.
This is a behavior change from AWS S3, but it is done with
better judgment on our end to allow the listing of buckets only
which user has access to.
The advantage is this declutters the UI for users and only
lists bucket which they have access to.
Precursor for this feature to be applicable is a policy
must have the following actions
```
s3:ListAllMyBuckets
```
and
```
s3:ListBucket
```
enabled in the policy.
Fixes#7458Fixes#7573Fixes#7938Fixes#6934Fixes#6265Fixes#6630
This will allow the cache to consistently work for
server and gateways. Range GET requests will
be cached in the background after the request
is served from the backend.
- All cached content is automatically bitrot protected.
- Avoid ETag verification if a cache-control header
is set and the cached content is still valid.
- This PR changes the cache backend format, and all existing
content will be migrated to the new format. Until the data is
migrated completely, all content will be served from the backend.
Refactor the Dirent parsing code such that when we
calculate offsets are correct based on the platform
This PR fixes a silent potential crash on ARM
architecture.
This commit fixes a DoS issue that is caused by an incorrect
SHA-256 content verification during STS requests.
Before that fix clients could write arbitrary many bytes
to the server memory. This commit fixes this by limiting the
request body size.
This change adds admin APIs and IAM subsystem APIs to:
- add or remove members to a group (group addition and deletion is
implicit on add and remove)
- enable/disable a group
- list and fetch group info
When checking if federation is necessary, the code compares
the SRV record stored in etcd against the list of endpoints
that the MinIO server is exposing. If there is an intersection
in this list the request is forwarded.
The SRV record includes both the host and the port, but the
intersection check previously only looked at the IP address. This
would prevent federation from working in situations where the endpoint
IP is the same for multiple MinIO servers. Some examples of where this
can occur are:
- running mulitiple copies of MinIO on the same host
- using multiple MinIO servers behind a NAT with port-forwarding
This commit adds a new method `UpdateKey` to the KMS
interface.
The purpose of `UpdateKey` is to re-wrap an encrypted
data key (the key generated & encrypted with a master key by e.g.
Vault).
For example, consider Vault with a master key ID: `master-key-1`
and an encrypted data key `E(dk)` for a particular object. The
data key `dk` has been generated randomly when the object was created.
Now, the KMS operator may "rotate" the master key `master-key-1`.
However, the KMS cannot forget the "old" value of that master key
since there is still an object that requires `dk`, and therefore,
the `D(E(dk))`.
With the `UpdateKey` method call MinIO can ask the KMS to decrypt
`E(dk)` with the old key (internally) and re-encrypted `dk` with
the new master key value: `E'(dk)`.
However, this operation only works for the same master key ID.
When rotating the data key (replacing it with a new one) then
we perform a `UnsealKey` operation with the 1st master key ID
and then a `GenerateKey` operation with the 2nd master key ID.
This commit also updates the KMS documentation and removes
the `encrypt` policy entry (we don't use `encrypt`) and
add a policy entry for `rewarp`.
After some extensive refactors, it turned out empty directories
are not healed and heal status is also not reported correctly.
This commit fixes it and adds the appropriate unit tests
This PR fixes relying on r.Context().Done()
by setting
```
Connection: "close"
```
HTTP Header, this has detrimental issues for
client side connection pooling. Since this
header explicitly tells clients to turn-off
connection pooling. This causing pro-active
connections to be closed leaving many conn's
in TIME_WAIT state. This can be observed with
`mc admin trace -a` when running distributed
setup.
This PR also fixes tracing filtering issue
when bucket names have `minio` as prefixes,
trace was erroneously ignoring them.
Golang proactively prints this error
`http: proxy error: context canceled`
when a request arrived to the current deployment and
redirected to another deployment in a federated setup.
Since this error can confuse users, this commit will
just hide it.
Allow renaming/editing a notification config. By replying with
a successful GetBucketNotification response, without checking
for any missing config ARN in targetList.
Fixes#7650
Related to #7982, this PR refactors the code
such that we validate the OPA or JWKS in a
common place.
This is also a refactor which is already done
in the new config migration change. Attempt
to avoid any network I/O during Unmarshal of
JSON from disk, instead do it later when
updating the in-memory data structure.
In situations such as when client uploading data,
prematurely disconnects from server such as pressing
ctrl-c before uploading all the data. Under this
situation in distributed setup we prematurely
disconnect disks causing a reconnect loop. This has
an adverse affect we end up leaving a lot of files
in temporary location which ideally should have been
cleaned up when Put() prematurely fails.
This is also a regression which got introduced in #7610
- Policy mapping is now at `config/iam/policydb/users/myuser1.json`
and includes version.
- User identity file is now versioned.
- Migrate old data to the new format.
Calling ListMultipartUploads fails if an upload is aborted while a
part is being uploaded because the directory for the upload exists
(since fsRenameFile ends up calling os.MkdirAll) but the meta JSON file
doesn't. To fix this we make sure an upload hasn't been aborted during
PutObjectPart by checking the existence of the directory for the upload
while moving the temporary part file into it.
This PR is based off @sinhaashish's PR for object lifecycle
management, which includes support only for,
- Expiration of object
- Filter using object prefix (_not_ object tags)
N B the code for actual expiration of objects will be included in a
subsequent PR.
Current code already allows users to GetPolicy/SetPolicy
there was a missing code in ListAllBucketPolicies to allow
access, this fixes this behavior.
Fixes#7913
The fix in #7646 introduced a regression which
was left unnoticed, the fix didn't work for
sub-commands unfortunately. This fixes it
by moving v1.21.0 version of the minio/cli
package.
Fixes#7924
posix.VerifyFile() doesn't know how to check if a file
is corrupted if that file is empty. We do have the part
size in xl.json so we pass it to VerifyFile to return
an error so healing empty parts can work properly.
When MinIO is behind a proxy, proxies end up killing
clients when no data is seen on the connection, adding
the right content-type ensures that proxies do not come
in the way.
Update prompt shows some weird characters under Windows, the reason
is that go-prompt is used to show a yes/no prompt, since go-prompt
does not seem to have a way to support color/fatih, this PR will
implements its own yes/no prompt with the correct text coloration.
This allows for canonicalization of the strings
throughout our code and provides a common space
for all these constants to reside.
This list is rather non-exhaustive but captures
all the headers used in AWS S3 API operations
r.URL.Path is empty when HEAD bucket with virtual
DNS requests come in since bucket is now part of
r.Host, we should use our domain names and fetch
the right bucket/object names.
This fixes an really old issue in our federation
setups.
This API returns the information related to the self healing routine.
For the moment, it returns:
- The total number of objects that are scanned
- The last time when an item was scanned
- return all objects in web-handlers listObjects response
- added local pagination to object list ui
- also fixed infinite loader and removed unused fields
While checking for port availability, ip address should be included.
When a machine has multiple ip addresses, multiple minio instances
or some other applications can be run on same port but different
ip address.
Fixes#7685
This PR adds support for adding session policies
for further restrictions on STS credentials, useful
in situations when applications want to generate
creds for multiple interested parties with different
set of policy restrictions.
This session policy is not mandatory, but optional.
Fixes#7732
This commit relaxes the restriction that the MinIO gateway
does not accept SSE-KMS headers. Now, the S3 gateway allows
SSE-KMS headers for PUT and MULTIPART PUT requests and forwards them
to the S3 gateway backend (AWS). This is considered SSE pass-through
mode.
Fixes#7753
Previously the read/write lock applied both for gateway use cases as
well the object store use case. Nothing from sys is touched or looked
at in the gateway usecase though, so we don't need to lock. Don't lock
to make the gateway policy getting a little more efficient, particularly
as where this is called from (checkRequestAuthType) is quite common.
etcd when used in federated setups, currently
mandates that all clusters should have same
config.json, which is too restrictive and makes
federation a restrictive environment.
This change makes it apparent that each cluster
needs to be independently managed if necessary
from `mc admin info` command line.
Each cluster with in federation can have their
own root credentials and as well as separate
regions. This way buckets get further restrictions
and allows for root creds to be not common
across clusters/data centers.
Existing data in etcd gets migrated to backend
on each clusters, upon start. Once done
users can change their config entries
independently.
- Background Heal routine receives heal requests from a channel, either to
heal format, buckets or objects
- Daily sweeper lists all objects in all buckets, these objects
don't necessarly have read quorum so they can be removed if
these objects are unhealable
- Heal daily ops receives objects from the daily sweeper
and send them to the heal routine.
Inconsistencies can arise after applying bucket policies in
gateway mode, since all gateway instances do not share a
common shared state. This is by design to keep gateway as
shared nothing architecture.
This PR fixes such inconsistencies by reloading policy
if any from the backend.
Fixes#7723
Consider errors returned by httpClient.Do() as network errors. This is because
the http clients returns different types of errors and it is hard to catch
all the error types.
With these changes we are now able to peak performances
for all Write() operations across disks HDD and NVMe.
Also adds readahead for disk reads, which also increases
performance for reads by 3x.
IsTruncated should not be set to true if there is no further
possible entries beyond maxKeys.
This commit will also move wide testing on object API from xl
to xl sets.
The problem in current code was we were removing
an entry from a lock lockerMap without considering
the fact that different entry for same resource is
a possibility due the nature of locks that can be
acquired in parallel before we decide if the lock
is considered stale
A sequence of events is as follows
- Lock("resource")
- lockMaintenance(finds a long lived lock in this "resource")
- Owner node rebooted which now retruns Expired() as true for
this "resource"
- Unlock("resource") which succeeded in quorum
- Now by this time application retried and acquired a new
Lock() on the same "resource"
- Now that we have Expired() true from the previous call,
we proceed to purge the entry from the local lockMap()
local lockMap reports a different entry for the expired
UID which results in a spurious log entry.
This PR removes this logging as this situation is an
expected scenario.
This will allow cache to consistently work for
server and gateways. Range GET requests will
be cached in the background after the request
is served from the backend.
Fixes: #7458, #7573, #6265, #6630
This is necessary to avoid connection build up between servers
unexpectedly for example in a situation where 16 servers are
talking to each other and one server now allows a maximum of
15*4096 = 61440 idle connections
Will be kept in pool. Such a large pool is perhaps inefficient for
many reasons and also affects overall system resources.
This PR also reduces idleConnection timeout from 120 secs to 60 secs.
errs was passed to many goroutines but they are all allowed
to update errs if any error happens during deletion, which
can cause a data race.
This commit will avoid issuing bulk delete operations in parallel
to avoid the warning race.
We broke into parts previously as we had checksum for the entire file
to tax less on memory and to have better TTFB. We dont need to now,
after the streaming-bitrot change.
Bulk delete at storage level in Multiple Delete Objects API
In order to accelerate bulk delete in Multiple Delete objects API,
a new bulk delete is introduced in storage layer, which will accept
a list of objects to delete rather than only one. Consequently,
a new API is also need to be added to Object API.
PR #7595 fixed part of the regression, but did not handle the
scenario, where in docker, the internal port is different from
the port on the host.
This PR modifies the regular expression such that all the
scenarios are handled.
Fixes#7619
After recent listing refactor, recursive list doesn't return empty
directories, this commit will fix the behavior and add unit tests
so it won't happen again.
When size is unknown and auto encryption is enabled,
and compression is set to true, putobject API is failing.
Moving adding the SSE-S3 header as part of the request to before
checking if compression can be done, otherwise the size is set to -1
and that seems to cause problems.
Currently we used to reload users every five minutes,
regardless of etcd is configured or not. But with etcd
configured we can do this more asynchronously to trigger
a refresh by using the watch API
Fixes#7515
One user has seen this following error log:
API: CompleteMultipartUpload(bucket=vertica, object=perf-dss-v03/cc2/02596813aecd4e476d810148586c2a3300d00000013557ef_0.gt)
Time: 15:44:07 UTC 04/11/2019
RequestID: 159475EFF4DEDFFB
RemoteHost: 172.26.87.184
UserAgent: vertica-v9.1.1-5
Error: open /data/.minio.sys/tmp/100bb3ec-6c0d-4a37-8b36-65241050eb02/xl.json: file exists
1: cmd/xl-v1-metadata.go:448:cmd.writeXLMetadata()
2: cmd/xl-v1-metadata.go:501:cmd.writeUniqueXLMetadata.func1()
This can happen when CompleteMultipartUpload fails with write quorum,
the S3 client will retry (since write quorum is 500 http response),
however the second call of CompleteMultipartUpload will fail because
this latter doesn't truly use a random uuid under .minio.sys/tmp/
directory but pick the upload id.
This commit fixes the behavior to choose a random uuid for generating
xl.json
Since AssumeRole API was introduced we have a wrong route
match which results in certain clients failing to upload objects
using multipart because, multipart POST conflicts with STS POST
AssumeRole API.
Write a proper matcher function which verifies the route more
appropriately such that both can co-exist.
Other listing optimizations include
- remove double sorting while filtering object entries
- improve error message when upload-id is not in quorum
- use jsoniter for full unmarshal json, instead of gjson
- remove unused code
Allow server to start if one of the local nodes in docker/kubernetes setup is successfully resolved
- The rule is that we need atleast one local node to work. We dont need to resolve the
rest at that point.
- In a non-orchestrational setup, we fail if we do not have atleast one local node up
and running.
- In an orchestrational setup (docker-swarm and kubernetes), We retry with a sleep of 5
seconds until any one local node shows up.
Fixes#6995
In distributed mode, use REST API to acquire and manage locks instead
of RPC.
RPC has been completely removed from MinIO source.
Since we are moving from RPC to REST, we cannot use rolling upgrades as the
nodes that have not yet been upgraded cannot talk to the ones that have
been upgraded.
We expect all minio processes on all nodes to be stopped and then the
upgrade process to be completed.
Also force http1.1 for inter-node communication
common prefixes in bucket name if already created
are disallowed when etcd is configured due to the
prefix matching issue. Make sure that when we look
for bucket we are only interested in exact bucket
name not the prefix.
- [x] Support bucket and regular object operations
- [x] Supports Select API on HDFS
- [x] Implement multipart API support
- [x] Completion of ListObjects support
There is no written specification about how to encode key names
when url encoding type is passed.
However, this change will encode URLs as url.QueryEscape() does
while considering AWS S3 exceptions.
This commit adds a unit test for the vault
config verification (which covers also `IsEmpty()`).
Vault-related code is hard to test with unit tests
since a Vault service would be necessary. Therefore
this commit only adds tests for a fraction of the code.
Fixes#7409
Most hadoop distributions hortonworks, cloudera all
depend on aws-sdk-java 1.7.x to 1.10.x - the releases
which have bugs related case sensitive check for
ETag header. Go changes the case of the headers set
to be canonical but only preserves them when set
through a direct map.
This fixes most compatibility issues we have had
in the past supporting older hadoop distributions.
This commit fixes a privilege escalation issue against
the S3 and web handlers. An authenticated IAM user
can:
- Read from or write to the internal '.minio.sys'
bucket by simply sending a properly signed
S3 GET or PUT request. Further, the user can
- Read from or write to the internal '.minio.sys'
bucket using the 'Upload'/'Download'/'DownloadZIP'
API by sending a "browser" request authenticated
with its JWT token.
This commit fixes another privilege escalation issue
abusing the inter-node communication of distributed
servers to obtain/modify the server configuration.
The inter-node communication is authenticated using
JWT-Tokens. Further, IAM users accessing the cluster
via the web UI also get a JWT token and the browser
will add this "user" JWT token to each the request.
Now, a user can extract that JWT token an can craft
HTTP POST requests for the inter-node communication
API endpoint. Since the server accepts ANY valid
JWT token it also accepts inter-node commands from
an authenticated user such that the user can execute
arbitrary commands bypassing the IAM policy engine
and impersonate other users, change its own IAM policy
or extract the admin access/secret key.
This is fixed by only accepting "admin" JWT tokens
(tokens containing the admin access key - and therefore
were generated with the admin secret key). Consequently,
only the admin user can execute such inter-node commands.
Simplify the cmd/http package overall by removing
custom plain text v/s tls connection detection, by
migrating to go1.12 and choose minimum version
to be go1.12
Also remove all the vendored deps, since they
are not useful anymore.
A race is detected between a bytes.Buffer generated with cmd/rpc.Pool
and http2 module. An issue is raised in golang (https://github.com/golang/go/issues/31192).
Meanwhile, this commit disables Pool in RPC code and it generates a
new 1kb of bytes.Buffer for each RPC call.
Before this commit, nodes wait indefinitely without showing any
indicate error message when a node is started with different access
and secret keys.
This PR will show '401 Unauthorized' in this case.
Currently message is set to error type value.
Message field is not used in error logs. it is used only in the case of info logs.
This PR sets error message field to store error type correctly.
Copying an encrypted SSEC object when this latter is uploaded using
multipart mechanism was failing because ETag in case of encrypted
multipart upload is not encrypted.
This PR fixes the behavior.
This fixes varying pids for server-respawns. And avoids duplicate process
creating multiple pids when the server restart signal is triggered with
service restart enabled.
Fixes#7350
It is required to set the environment variable in the case of distributed
minio. LoadCredentials is used to notify peers of the change and will not work if
environment variable is set. so, this function will never be called.
In scenario 1
```
- bucket/object-prefix
- bucket/object-prefix/object
```
Server responds with `XMinioParentIsObject`
In scenario 2
```
- bucket/object-prefix/object
- bucket/object-prefix
```
Server responds with `XMinioObjectExistsAsDirectory`
Fixes#6566
Healing scan used to read all objects parts to check for bitrot
checksum. This commit will add a quicker way of healing scan
by only checking if parts are actually present in disks or not.
We should internally handle when http2 input stream has smaller
content than its content-length header
Upstream issue reported https://github.com/golang/go/issues/30648
This a change which we need to handle internally until Go fixes it
correctly, till now our code doesn't expect a custom error to be returned.
CopyObject precondition checks into GetObjectReader
in order to perform SSE-C pre-condition checks using the
last 32 bytes of encrypted ETag rather than the decrypted
ETag
This also necessitates moving precondition checks for
gateways to gateway layer rather than object handler check
if a bucket with `Captialized letters` is created, `InvalidBucketName` error
will be returned.
In the case of pre-existing buckets, it will be listed.
Fixes#6938
Prevents deferred close functions from being called while still
attempting to copy reader to snappyWriter.
Reduces code duplication when compressing objects.
This change allows indefinitely running go-routines to cleanup
gracefully.
This channel is now closed at the beginning of each test so that
long-running go-routines quit and a new one is assigned.
The side affect of this change memory
increase, but this is a trade-off between
performance and actual memory usage.
For all practical scenarios this should be
an adequate change.
- The events will be persisted in queueStore if `queueDir` is set.
- Else, if queueDir is not set events persist in memory.
The events are replayed back when the mqtt broker is back online.
Clients like AWS SDK Java and AWS cli XML parsers are
unable to handle on `\r\n` characters to avoid these
errors send XML header first and write white space characters
instead.
Also handle cases to avoid double WriteHeader calls
- Current implementation was spawning renewer goroutines
without waiting for the lease duration to end. Remove vault renewer
and call vault.RenewToken directly and manage reauthentication if
lease expired.
We should change the logic for both isObject()
and isObjectDir() leaf detection to be done
with quorum, due to how our directory navigation
works - this allows for properly deleting all
the dangling directories or objects if any.
This commit fixes a nil pointer dereference issue
that can occur when the Vault KMS returns e.g. a 404
with an empty HTTP response. The Vault client SDK
does not treat that as error and returns nil for
the error and the secret.
Further it simplifies the token renewal and
re-authentication mechanism by using a single
background go-routine.
The control-flow of Vault authentications looks
like this:
1. `authenticate()`: Initial login and start of background job
2. Background job starts a `vault.Renewer` to renew the token
3. a) If this succeeds the token gets updated
b) If this fails the background job tries to login again
4. If the login in 3b. succeeded goto 2. If it fails
goto 3b.
Currently, we were sending errors in Select binary format,
which is incompatible with AWS S3 behavior, errors in binary
are sent after HTTP status code is already 200 OK - i.e it
happens during the evaluation of the record reader.
This commit increases storage REST requests to 5 minutes, this includes
the opening TCP connection, and sending/receiving data. This will reduce
clients receiving errors when the server is under high load.
Different gateway implementations due to different backend
API errors, might return different unsupported errors at
our handler layer. Current code posed a problem for us because
this information was lost and we would convert it to InternalError
in this situation all S3 clients end up retrying the request.
To avoid this unexpected situation implement a way to support
this cleanly such that the underlying information is not lost
which is returned by gateway.
Bucket metadata healing in the current code was executed multiple
times each time for a given set. Bucket metadata just like
objects are hashed in accordance with its name on any given set,
to allow hashing to play a role we should let the top level
code decide where to navigate.
Current code also had 3 bucket metadata files hardcoded, whereas
we should make it generic by listing and navigating the .minio.sys
to heal such objects.
We also had another bug where due to isObjectDangling changes
without pre-existing bucket metadata files, we were erroneously
reporting it as grey/corrupted objects.
This PR fixes all of the above items.
This PR also adds some comments and simplifies
the code. Primary handling is done to ensure
that we make sure to honor cached buffer.
Added unit tests as well
Fixes#7141
foo.CORRUPTED should never be created because when
multiple sets are involved we would hash the file
to wrong a location, this PR removes the code.
But allows DeleteBucket() to work properly to delete
dangling buckets/objects. Also adds another option
to Healing where a user needs to specify `--remove`
such that all dangling objects will be deleted with
user confirmation.
ListObjectParts is using xl.readXLMetaParts which picks the first
xl meta found in any disk, which is an inconsistent information.
E.g.: In a middle of a multipart upload, one node can go offline
and get back later with an outdated multipart information.
This commit fixes the computation of Before/After healing state
for empty directories.
Issues before the commit:
- Before state doesn't reflect the real status (no StatVol() called)
- For any MakeVol() error, healObjectDir is exited directly, which is
wrong.
Currently during a heal of a bucket, if one disk is offline an empty endpoint entry is added.
Then another entry with the missing endpoint is also added.
This results in more entries than disks being added.
Code that adds empty endpoint has been removed.
Collect historic cpu and mem stats. Also, use actual values
instead of formatted strings while returning to the client. The string
formatting prevents values from being processed by the server or
by the client without parsing it.
This change will allow the values to be processed (eg.
compute rolling-average over the lifetime of the minio server)
and offloads the formatting to the client.
We made a change previously in #7111 which moved support
for AWS envs only for AWS S3 endpoint. Some users requested
that this be added back to Non-AWS endpoints as well as
they require separate credentials for backend authentication
from security point of view.
More than one client can't use the same clientID for MQTT connection.
This causes problem in distributed deployments where config is shared
across nodes, as each Minio instance tries to connect to MQTT using the
same clientID.
This commit removes the clientID field in config, and allows
MQTT client to create random clientID for each node.
- New parser written from scratch, allows easier and complete parsing
of the full S3 Select SQL syntax. Parser definition is directly
provided by the AST defined for the SQL grammar.
- Bring support to parse and interpret SQL involving JSON path
expressions; evaluation of JSON path expressions will be
subsequently added.
- Bring automatic type inference and conversion for untyped
values (e.g. CSV data).
This situation happens only in gateway nas which supports
etcd based `config.json` to support all FS mode features.
The issue was we would try to migrate something which doesn't
exist when etcd is configured which leads to inconsistent
server configs in memory.
This PR fixes this situation by properly loading config after
initialization, avoiding backend disk config migration to be
done only if etcd is not configured.
If it does happen that we have a lot files in '.minio.sys/tmp',
minio startup might block deleting this folder. Rename and
delete in background instead to allow Minio to start serving
requests.
To avoid a large number of concurrent connections between minio
servers and to reduce CPU pressure, it is better to limit the number
of objects healed in parallel to number_of_CPUs.
Requirements like being able to run minio gateway in ec2
pointing to a Minio deployment wouldn't work properly
because IAM creds take precendence on ec2.
Add checks such that we only enable AWS specific features
if our backend URL points to actual AWS S3 not S3 compatible
endpoints.
Returning unexpected errors can cause problems for config handling,
which is what led gateway deployments with etcd to misbehave and
had stopped working properly
Deployment ID is not copied into new formats after healing format. Although,
this is not critical since a new deployment ID will be generated and set in the
next cluster restart, it is still much better if we don't change the deployment
id of a cluster for a better tracking.
Fix regexp matcher for special assets for the browser to clash with
less of the object namespace.
Assets should now be loaded with the /minio/ prefix. Previously,
favicon.ico (and others) could be loaded at any path matching
/minio/*/favicon.ico. This clashes with a large part of the object
namespace. With this change, /minio/favicon.ico will serve the favicon
but not /minio/mybucket/favicon.ico
Fixes#7077
This changes causes `getRootCAs` to always load system-wide CAs.
Any additional custom CAs (at `certs/CA/`) are added to the certificate pool
of system CAs.
The previous behavior was incorrect since all no system-wide CAs were
loaded if either there were CAs under `certs/CA` or the `certs/CA`
directory didn't exist at all.
Also add a cross compile script to test always cross
compilation for some well known platforms and architectures
, we support out of box compilation of these platforms even
if we don't make an official release build.
This script is to avoid regressions in this area when we
add platform dependent code.
This PR supports iam and bucket policies to have
policy variable replacements in resource and
condition key values.
For example
- ${aws:username}
- ${aws:userid}
Health checking programs very frequently use /minio/health/live
to check health, hence we can avoid doing StorageInfo() and
ListBuckets() for FS/Erasure backend.
* Use 0-byte file for bitrot verification of whole-file-bitrot files
Also pass the right checksum information for bitrot verification
* Copy xlMeta info from latest meta except []checksums and []Parts while healing
Simplify parallelReader.Read() which also fixes previous
implementation where it was returning before all the parallel
reading go-routines had terminated which caused race conditions.
When auto-encryption is turned on, we pro-actively add SSEHeader
for all PUT, POST operations. This is unusual for V2 signature
calculation because V2 signature doesn't have a pre-defined set
of signed headers in the request like V4 signature. According to
V2 we should canonicalize all incoming supported HTTP headers.
Make sure to validate signatures before we mutate http headers
Deprecate the use of Admin Peers concept and migrate all peer
communication to Notification subsystem. This finally allows
for a common subsystem for all peer notification in case of
distributed server deployments.
Before this change the CopyObjectHandler and the CopyObjectPartHandler
both looked for a `versionId` parameter on the `X-Amz-Copy-Source` URL
for the version of the object to be copied on the URL unescaped version
of the header. This meant that files that had question marks in were
truncated after the question mark so that files with `?` in their
names could not be server side copied.
After this change the URL unescaping is done during the parsing of the
`versionId` parameter which fixes the problem.
This change also introduces the same logic for the
`X-Amz-Copy-Source-Version-Id` header field which was previously
ignored, namely returning an error if it is present and not `null`
since minio does not currently support versions.
S3 Docs:
- https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html
- https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html
When source is encrypted multipart object and the parts are not
evenly divisible by DARE package block size, target encrypted size
will not necessarily be the same as encrypted source object.
This commit removes old code preventing PUT requests with '/' as a path,
because this is not needed anymore after the introduction of the virtual
host style in Minio server code.
'PUT /' when global domain is not configured already returns 405 Method
Not Allowed http error.
This PR adds pass-through, single encryption at gateway and double
encryption support (gateway encryption with pass through of SSE
headers to backend).
If KMS is set up (either with Vault as KMS or using
MINIO_SSE_MASTER_KEY),gateway will automatically perform
single encryption. If MINIO_GATEWAY_SSE is set up in addition to
Vault KMS, double encryption is performed.When neither KMS nor
MINIO_GATEWAY_SSE is set, do a pass through to backend.
When double encryption is specified, MINIO_GATEWAY_SSE can be set to
"C" for SSE-C encryption at gateway and backend, "S3" for SSE-S3
encryption at gateway/backend or both to support more than one option.
Fixes#6323, #6696
This is part of implementation for mc admin health command. The
ServerDrivesPerfInfo() admin API returns read and write speed
information for all the drives (local and remote) in a given Minio
server deployment.
Part of minio/mc#2606
It can happen with erroneous clients which do not send `Host:`
header until 4k worth of header bytes have been read. This can lead
to Peek() method of bufio to fail with ErrBufferFull.
To avoid this we should make sure that Peek buffer is as large as
our maxHeaderBytes count.
minio-java tests were failing under multiple places when
auto encryption was turned on, handle all the cases properly
This PR fixes
- CopyObject should decrypt ETag before it does if-match
- CopyObject should not try to preserve metadata of source
when rotating keys, unless explicitly asked by the user.
- We should not try to decrypt Compressed object etag, the
potential case was if user sets encryption headers along
with compression enabled.
Especially in gateway IAM admin APIs are not enabled
if etcd is not enabled, we should enable admin API though
but only enable IAM and Config APIs with etcd configured.
By default when we listen on all interfaces, we print all the
endpoints that at local to all interfaces including IPv6
addresses. Remove IPv6 addresses in endpoint list to be
printed in endpoints unless explicitly specified with '--address'
This commit adds an auto-encryption feature which allows
the Minio operator to ensure that uploaded objects are
always encrypted.
This change adds the `autoEncryption` configuration option
as part of the KMS conifguration and the ENV. variable
`MINIO_SSE_AUTO_ENCRYPTION:{on,off}`.
It also updates the KMS documentation according to the
changes.
Fixes#6502
Currently we use GetObject to check if we are allowed to list,
this might be a security problem since there are many users now
who actively disable a publicly readable listing, anyone who
can guess the browser URL can list the objects.
This PR turns off this behavior and provides a more expected way
based on the policies.
This PR also additionally improves the Download() object
implementation to use a more streamlined code.
These are precursor changes to facilitate federation and web
identity support in browser.
One user reported having discovered the following error:
API: SYSTEM()
Time: 20:06:17 UTC 12/06/2018
Error: xml: encoding "US-ASCII" declared but Decoder.CharsetReader is nil
1: cmd/handler-utils.go:43:cmd.parseLocationConstraint()
2: cmd/auth-handler.go:250:cmd.checkRequestAuthType()
3: cmd/bucket-handlers.go:411:cmd.objectAPIHandlers.PutBucketHandler()
4: cmd/api-router.go100cmd.(objectAPIHandlers).PutBucketHandler-fm()
5: net/http/server.go:1947:http.HandlerFunc.ServeHTTP()
Hence, adding support of different xml encoding. Although there
is no clear specification about it, even setting "GARBAGE" as an xml
encoding won't change the behavior of AWS, hence the encoding seems
to be ignored.
This commit will follow that behavior and will ignore encoding field
and consider all xml as utf8 encoded.
This refactors the vault configuration by moving the
vault-related environment variables to `environment.go`
(Other ENV should follow in the future to have a central
place for adding / handling ENV instead of magic constants
and handling across different files)
Further this commit adds master-key SSE-S3 support.
The operator can specify a SSE-S3 master key using
`MINIO_SSE_MASTER_KEY` which will be used as master key
to derive and encrypt per-object keys for SSE-S3
requests.
This commit is also a pre-condition for SSE-S3
auto-encyption support.
Fixes#6329
This PR implements one of the pending items in issue #6286
in S3 API a user can request CSV output for a JSON document
and a JSON output for a CSV document. This PR refactors
the code a little bit to bring this feature.
guessIsRPCReq() considers all POST requests as RPC but doesn't
check if this is an object operation API or not, which is actually
confusing bucket forwarder handler when it receives a new multipart
upload API which is a POST http request.
Due to this bug, users having a federated setup are not able to
upload a multipart object using an endpoint which doesn't actually
contain the specified bucket that will store the object.
Hence this commit will fix the described issue.
registering notFound handler more than once causes
gorilla mux to return error for all registered paths
greater than > 8. This might be a bug in the gorilla/mux
but we shouldn't be using it this way. NotFound handler
should be only registered once per root router.
Fixes#6915
When MINIO_PUBLIC_IPS is not specified and no endpoints are passed
as arguments, fallback to the address of non loop-back interfaces.
This is useful so users can avoid setting MINIO_PUBLIC_IPS in docker
or orchestration scripts, ince users naturally setup an internal
network that connects all instances.
This commit renames the env variable for vault namespaces
such that it begins with `MINIO_SSE_`. This is the prefix
for all Minio SSE related env. variables (like KMS).
clientID must be a unique `UUID` for each connections. Now, the
server generates it, rather considering the config.
Removing it as it is non-beneficial right now.
Fixes#6364
When migrating configs it happens often that some
servers fail to start due to version mismatch etc.
Hold a transaction lock such that all servers get
serialized.
This can create inconsistencies i.e Parts might have
lesser number of parts than ChecksumInfos. This will
result in object to be not readable.
This PR also allows for deleting previously created
corrupted objects.
globalMinioPort is used in federation which stores the address
and the port number of the server hosting the specified bucket,
this latter uses globalMinioPort but this latter is not set in
startup of the gateway mode.
This commit fixes the behavior.
Calling /minio/prometheuses/metrics calls xlSets.StorageInfo() which creates a new
storage REST client and closes it. However, currently, closing does nothing
to the underlying opened http client.
This commit introduces a closing behavior by calling CloseIdleConnections
provided by http.Transport upon the initialization of this latter.
This refactor brings a change which allows
targets to be added in a cleaner way and also
audit is now moved out.
This PR also simplifies logger dependency for auditing
The current code triggers a timeout to cleanup a heal seq from
healSeqMap, but we don't know if the user did or not launch a new
healing sequence with the same path.
Add endTime to healSequence struct and add a periodic heal-sequence
cleaner to remove heal sequences only if this latter is older than
10 minutes.
Rolling update doesn't work properly because Storage REST API has
a new API WriteAll() but without API version number increase.
Also be sure to return 404 for unknown http paths.
To conform with AWS S3 Spec on ETag for SSE-S3 encrypted objects,
encrypt client sent MD5Sum and store it on backend as ETag.Extend
this behavior to SSE-C encrypted objects.
This improves the performance of certain queries dramatically,
such as 'count(*)' etc.
Without this PR
```
~ time mc select --query "select count(*) from S3Object" myminio/sjm-airlines/star2000.csv.gz
2173762
real 0m42.464s
user 0m0.071s
sys 0m0.010s
```
With this PR
```
~ time mc select --query "select count(*) from S3Object" myminio/sjm-airlines/star2000.csv.gz
2173762
real 0m17.603s
user 0m0.093s
sys 0m0.008s
```
Almost a 250% improvement in performance. This PR avoids a lot of type
conversions and instead relies on raw sequences of data and interprets
them lazily.
```
benchcmp old new
benchmark old ns/op new ns/op delta
BenchmarkSQLAggregate_100K-4 551213 259782 -52.87%
BenchmarkSQLAggregate_1M-4 6981901985 2432413729 -65.16%
BenchmarkSQLAggregate_2M-4 13511978488 4536903552 -66.42%
BenchmarkSQLAggregate_10M-4 68427084908 23266283336 -66.00%
benchmark old allocs new allocs delta
BenchmarkSQLAggregate_100K-4 2366 485 -79.50%
BenchmarkSQLAggregate_1M-4 47455492 21462860 -54.77%
BenchmarkSQLAggregate_2M-4 95163637 43110771 -54.70%
BenchmarkSQLAggregate_10M-4 476959550 216906510 -54.52%
benchmark old bytes new bytes delta
BenchmarkSQLAggregate_100K-4 1233079 1086024 -11.93%
BenchmarkSQLAggregate_1M-4 2607984120 557038536 -78.64%
BenchmarkSQLAggregate_2M-4 5254103616 1128149168 -78.53%
BenchmarkSQLAggregate_10M-4 26443524872 5722715992 -78.36%
```
xl.json is the source of truth for all erasure
coded objects, without which we won't be able to
read the objects properly. This PR enables sync
mode for writing `xl.json` such all writes go hit
the disk and are persistent under situations such
as abrupt power failures on servers running Minio.
This change will allow users to enter the endpoint of the
storage account if this latter belongs to a different Azure
cloud environment, such as US gov cloud.
e.g:
`MINIO_ACCESS_KEY=testaccount \
MINIO_SECRET_KEY=accountsecretkey \
minio gateway azure https://testaccount.blob.usgovcloudapi.net`
In many situations, while testing we encounter
ErrInternalError, to reduce logging we have
removed logging from quite a few places which
is acceptable but when ErrInternalError occurs
we should have a facility to log the corresponding
error, this helps to debug Minio server.
Multipart object final size is not a contiguous
encrypted object representation, so trying to
decrypt this size will lead to an error in some
cases. The multipart object should be detected first
and then decoded with its respective parts instead.
This PR handles this situation properly, added a
test as well to detect these in the future.
This commit adds key-rotation for SSE-S3 objects.
To execute a key-rotation a SSE-S3 client must
- specify the `X-Amz-Server-Side-Encryption: AES256` header
for the destination
- The source == destination for the COPY operation.
Fixes#6754
This PR adds support
- Request query params
- Request headers
- Response headers
AuditLogEntry is exported and versioned as well
starting with this PR.
On Windows erasure coding setup if
```
~ minio server V:\ W:\ X:\ Z:\
```
is not possible due to NTFS creating couple of
hidden folders, this PR allows minio to use
the entire drive.
Execute method in s3Select package makes a response.WriteHeader call.
Not calling it again in SelectObjectContentHandler function in case of
error in s3Select.Execute call.
On a heavily loaded server, getBucketInfo() becomes slow,
one can easily observe deleting an object causes many
additional network calls.
This PR is to let the underlying call return the actual
error and write it back to the client.
This PR supports two models for etcd certs
- Client-to-server transport security with HTTPS
- Client-to-server authentication with HTTPS client certificates
Endpoint comparisons blindly without looking
if its local is wrong because the actual drive
for a local disk is always going to provide just
the path without the HTTP endpoint.
Add code such that this is taken care properly in
all situations. Without this PR HealBucket() would
wrongly conclude that the healing doesn't have quorum
when there are larger number of local disks involved.
Fixes#6703
Current master didn't support CopyObjectPart when source
was encrypted, this PR fixes this by allowing range
CopySource decryption at different sequence numbers.
Fixes#6698
This PR fixes
- The target object should be compressed even if the
source object is not compressed.
- The actual size for an encrypted object should be the
`decryptedSize`
This commit fixes a wrong assignment to `actualPartSize`.
The `actualPartSize` for an encrypted src object is not `srcInfo.Size`
because that's the encrypted object size which is larger than the
actual object size. So the actual part size for an encrypted
object is the decrypted size of `srcInfo.Size`.
Without this fix we have room for two different type of
errors.
- Source is encrypted and we didn't provide any source encryption keys
This results in Incomplete body error to be returned back to the client
since source is encrypted and we gave the reader as is to the object
layer which was of a decrypted value leading to "IncompleteBody"
- Source is not encrypted and we provided source encryption keys.
This results in a corrupted object on the destination which is
considered encrypted but cannot be read by the server and returns
the following error.
```
<Error><Code>XMinioObjectTampered</Code><Message>The requested object
was modified and may be compromised</Message><Resource>/id-platform-gamma/
</Resource><RequestId>155EDC3E86BFD4DA</RequestId><HostId>3L137</HostId>
</Error>
```
This commit fixes a regression introduced in f187a16962
the regression returned AccessDenied when a client is trying to create an empty
directory on a existing prefix, though it should return 200 OK to be close as
much as possible to S3 specification.
Since refactoring to GetObjectNInfo style, there are many cases
when i/o closed pipe is printed like, downloading an object
with wrong encryption key. This PR removes the log.
This commit moves the check that SSE-C requests
must be made over TLS into a generic HTTP handler.
Since the HTTP server uses custom TCP connection handling
it is not possible to use `http.Request.TLS` to check
for TLS connections. So using `globalIsSSL` is the only
option to detect whether the request is made over TLS.
By extracting this check into a separate handler it's possible
to refactor other parts of the SSE handling code further.
This commit adds two functions for sealing/unsealing the
etag (a.k.a. content MD5) in case of SSE single-part upload.
Sealing the ETag is neccessary in case of SSE-S3 to preserve
the security guarantees. In case of SSE-S3 AWS returns the
content-MD5 of the plaintext object as ETag. However, we
must not store the MD5 of the plaintext for encrypted objects.
Otherwise it becomes possible for an attacker to detect
equal/non-equal encrypted objects. Therefore we encrypt
the ETag before storing on the backend. But we only need
to encrypt the ETag (content-MD5) if the client send it -
otherwise the client cannot verify it anyway.
CopyObject handler forgot to remove multipart encryption flag in metadata
when source is an encrypted multipart object and the target is also encrypted
but single part object.
This PR also simplifies the code to facilitate review.
This PR brings an additional logger implementation
called AuditLog which logs to http targets
The intention is to use AuditLog to log all incoming
requests, this is used as a mechanism by external log
collection entities for processing Minio requests.
This PR introduces two new features
- AWS STS compatible STS API named AssumeRoleWithClientGrants
```
POST /?Action=AssumeRoleWithClientGrants&Token=<jwt>
```
This API endpoint returns temporary access credentials, access
tokens signature types supported by this API
- RSA keys
- ECDSA keys
Fetches the required public key from the JWKS endpoints, provides
them as rsa or ecdsa public keys.
- External policy engine support, in this case OPA policy engine
- Credentials are stored on disks
- Only require len(disks)/2 to initialize the cluster
- Fix checking of read/write quorm in subsystems init
- Add retry mechanism in policy and notification to avoid aborting in case of read/write quorums errors
in xl.PutObjectPart call, prepareFile detected an error when the storage
is exhausted but we were returning the wrong error.
With this commit, users can see the correct error message when their disks
become full.
Simplify the logic of using rename() in xl. Currently, renaming
doesn't require the source object/dir to be existent in at least
read quorum disks, since there is no apparent reason for it
to be as a general rule, this commit will just simplify the
logic to avoid possible inconsistency in the backend in the future.
This is a major regression introduced in this commit
ce02ab613d is the first bad commit
commit ce02ab613d
Author: Krishna Srinivas <634494+krishnasrinivas@users.noreply.github.com>
Date: Mon Aug 6 15:14:08 2018 -0700
Simplify erasure code by separating bitrot from erasure code (#5959)
:040000 040000 794f58d82ad2201ebfc8 M cmd
This effects all distributed server deployments since this commit
All the following releases are affected
- RELEASE.2018-09-25T21-34-43Z
- RELEASE.2018-09-12T18-49-56Z
- RELEASE.2018-09-11T01-39-21Z
- RELEASE.2018-09-01T00-38-25Z
- RELEASE.2018-08-25T01-56-38Z
- RELEASE.2018-08-21T00-37-20Z
- RELEASE.2018-08-18T03-49-57Z
Thanks to Anis for reproducing the issue
Without this PR minio server is writing an erroneous
response to clients on an idle connections which ends
up printing following message
```
Unsolicited response received on idle HTTP channel
```
This PR would avoid sending responses on idle connections
.i.e routine network errors.
In XL PutObject & CompleteMultipartUpload, the existing object is renamed
to the temporary directory before checking if worm is enabled or not.
Most of the times, this doesn't cause an issue unless two uploads to the
same location occurs at the same time. Since there is no locking in object
handlers, both uploads will reach XL layer. The second client acquiring
write lock in put object or complete upload in XL will rename the object
to the temporary directory before doing the check and returning the error (wrong!).
This commit fixes then the behavior: no rename to temporary directory if
worm is enabled.
Current implementation simply uses all the memory locally
and crashes when a large upload is initiated using Minio
browser UI.
This PR uploads stream in blocks and finally commits the blocks
allowing for low memory footprint while uploading large objects
through Minio browser UI.
This PR also adds ETag compatibility for single PUT operations.
Fixes#6542Fixes#6550
This to ensure that we heal all entries in config/
prefix, we will have IAM and STS related files which
are being introduced in #6168 PR
This is a change to ensure that we heal all of them
properly, not just `config.json`
go test shows the following warning:
```
WARNING: DATA RACE
Write at 0x000002909e18 by goroutine 276:
github.com/minio/minio/cmd.testAdminCmdRunnerSignalService()
/home/travis/gopath/src/github.com/minio/minio/cmd/admin-rpc_test.go:44 +0x94
Previous read at 0x000002909e18 by goroutine 194:
github.com/minio/minio/cmd.testServiceSignalReceiver()
/home/travis/gopath/src/github.com/minio/minio/cmd/admin-handlers_test.go:467 +0x70
```
The reason for this data race is that some admin tests are not waiting for go routines
that they created to be properly exited, which triggers the race detector.
When download profiling data API fails to gather profiling data
from all nodes for any reason (including profiler not enabled),
return 400 http code with the appropriate json message.
This commit adds two functions for removing
confidential information - like SSE-C keys -
from HTTP headers / object metadata.
This creates a central point grouping all
headers/entries which must be filtered / removed.
See also https://github.com/minio/minio/pull/6489#discussion_r219797993
of #6489
The new call combines GetObjectInfo and GetObject, and returns an
object with a ReadCloser interface.
Also adds a number of end-to-end encryption tests at the handler
level.
Allow minio s3 gateway to use aws environment credentials,
IAM instance credentials, or AWS file credentials.
If AWS_ACCESS_KEY_ID, AWS_SECRET_ACCSES_KEY are set,
or minio is running on an ec2 instance with IAM instance credentials,
or there is a file $HOME/.aws/credentials, minio running as an S3
gateway will authenticate with AWS S3 using those one of credentials.
The lookup order:
1. AWS environment varaibles
2. IAM instance credentials
3. $HOME/.aws/credentials
4. minio environment variables
To authenticate with the minio gateway, you will always use the
minio environment variables MINIO_ACCESS_KEY MINIO_SECRET_KEY.
Two handlers are added to admin API to enable profiling and disable
profiling of a server in a standalone mode, or all nodes in the
distributed mode.
/minio/admin/profiling/start/{cpu,block,mem}:
- Start profiling and return starting JSON results, e.g. one
node is offline.
/minio/admin/profiling/download:
- Stop the on-going profiling task
- Stream a zip file which contains all profiling files that can
be later inspected by go tool pprof
The test TestServerTLSCiphers seems to fail sometimes for
no obvious reason. Actually the test is not needed
(as unit test) since minio/mint tests the server's TLS ciphers
as part of its security tests.
Fixes#5977
Currently, one node in a cluster can fail to boot with the following error message:
```
ERROR Unable to initialize config system: Storage resources are insufficient for the write operation
```
This happens when disks are formatted, read quorum is met but write
quorum is not met. In checkServerConfig(), a insufficient read quorum
error is replaced by errConfigNotFound, the code will generate a
new config json and try to save it, but it will fail because write
quorum is not met.
Replacing read quorum with errConfigNotFound is also wrong because it
can lead, in rare cases, to overwrite the config set by the user.
So, this commit adds a retry mechanism in configuration initialization
to retry only with read or write quorum errors.
This commit will also fix the following cases:
- Read quorum is lost just after the initialization of the object layer.
- Write quorum not met when upgrading configuration version.
ReadFile RPC input argument has been changed in commit a8f5939452959d27674560c6b803daa9,
however, RPC doesn't detect such a change when it calls other nodes with older versions.
Hence, bumping RPC version.
Fixes#6458
It was expected that in gateway mode, we do not know
the backend types whereas in NAS gateway since its
an extension of FS mode (standalone) this leads to
an issue in LivenessCheckHandler() which would perpetually
return 503, this would affect all kubernetes, openshift
deployments of NAS gateway.
This commit fixes an AWS S3 incompatibility issue.
The AccessKeyID may contain one or more `/` which caused
the server to interpret parts of the AccessKeyID as
other `X-Amz-Credential` parameters (like date, region, ...)
This commit fixes this by allowing 5 or more
`X-Amz-Credential` parameter strings and only interpreting
the last 5.
Fixes#6443
This commit will print connection failures to other disks in other nodes
after 5 retries. It is useful for users to understand why the
distribued cluster fails to boot up.
Enhance a little bit the error message that is showing
when access & secret keys are not specified in the
environment when running Minio in gateway and server mode.
This commit also removes a redundant check of access/secret keys.
This commit fixes the Manta gateway client creation flow. We now affix
the endpoint scheme with endpoint URL while creating the Manta client
for gateway.
Also add steps in Manta gateway docs on how to run with custom Manta
endpoint.
Fixes#6408
This commit fixes are regression in the server regarding
handling SSE requests with wrong SSE-C keys.
The server now returns an AWS S3 compatable API error (access denied)
in case of the SSE key does not match the secret key used during upload.
Fixes#6431
This PR adds two new admin APIs in Minio server and madmin package:
- GetConfigKeys(keys []string) ([]byte, error)
- SetConfigKeys(params map[string]string) (err error)
A key is a path in Minio configuration file, (e.g. notify.webhook.1)
The user will always send a string value when setting it in the config file,
the API will know how to convert the value to the appropriate type. The user
is also able to set a raw json.
Before setting a new config, Minio will validate all fields and try to connect
to notification targets if available.
Currently Go http connection pool was not being properly
utilized leading to degrading performance as the number
of concurrent requests increased.
As recommended by Go implementation, we have to drain the
response body and close it.
Removing an empty directory is not working because of xl.DeleteObject()
was only checking if the passed prefix is an actual object but it
should also check if it is an empty directory.
soMaxConn value is 128 on almost all linux systems,
this value is too low for Minio at times when used
against large concurrent workload e.g: spark applications
this causes a sort of SYN flooding observed by the kernel
to allow for large backlog increase this value to 2048.
With this value we do not see anymore SYN flooding
kernel messages.
ListMultipartUploads implementation is meant for docker-registry
use-case only. It lists only the first upload with a prefix matching
the object being uploaded.
* Revert "Encrypted reader wrapped in NewGetObjectReader should be closed (#6383)"
This reverts commit 53a0bbeb5b.
* Revert "Change SelectAPI to use new GetObjectNInfo API (#6373)"
This reverts commit 5b05df215a.
* Revert "Implement GetObjectNInfo object layer call (#6290)"
This reverts commit e6d740ce09.
An issue was reproduced when minio-js client functional
tests are setting lower case http headers, in our current
master branch we specifically look for canonical host header
which may be not necessarily true for all http clients.
This leads to a perpetual hang on the *net.Conn*.
This PR fixes regression caused by #6206 by handling the
case insensitivity.
This combines calling GetObjectInfo and GetObject while returning a
io.ReadCloser for the object's body. This allows the two operations to
be under a single lock, fixing a race between getting object info and
reading the object body.
One typo introduced in a recent commit miscalculates if worm and browser
are enabled or not. A simple test is also added to detect this issue
in the future if it ever happens again.
In current master when you do `mc watch` you can see a
dynamic ARN being listed which exposes the remote IP as well
```
mc watch play/airlines
```
On another terminal
```
mc admin info play
● play.minio.io:9000
Uptime : online since 11 hours ago
Version : 2018-08-22T07:50:45Z
Region :
SQS ARNs : arn:minio:sqs::httpclient+51c39c3f-131d-42d9-b212-c5eb1450b9ee+73.222.245.195:33408
Stats : Incoming 30GiB, Outgoing 7.6GiB
Storage : Used 7.7GiB
```
SQS ARNs listed as part of ServerInfo should be only external targets,
since listing an ARN here is not useful and it cannot be re-purposed in
any manner.
This PR fixes this issue by filtering out httpclient from the ARN list.
This is a regression introduced in #52940e4431725c
This commit adds error handling for SSE-KMS requests to
HEAD, GET, PUT and COPY operations. The server responds
with `not implemented` if a client sends a SSE-KMS
request.
This package provide customizable TCP net.Listener with various
performance-related options:
* SO_REUSEPORT. This option allows linear scaling server performance
on multi-CPU servers.
See https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/ for details.
* TCP_DEFER_ACCEPT. This option expects the server reads from the accepted
connection before writing to them.
* TCP_FASTOPEN. See https://lwn.net/Articles/508865/ for details.
Add support for sse-s3 encryption with vault as KMS.
Also refactoring code to make use of headers and functions defined in
crypto package and clean up duplicated code.
This PR fixes a regression introduced in 8eb838bf91
where hashing technique was used on prefixes to get the right set
to perform the operation, this is not correct since prefixes and
their corresponding keys might hash to a different value depending
on the key length.
For prefixes/directories we should look everywhere to support proper
quorum based listing.
Fixes#6293
This PR is the first set of changes to move the config
to the backend, the changes use the existing `config.json`
allows it to be migrated such that we can save it in on
backend disks.
In future releases, we will slowly migrate out of the
current architecture.
Fixes#6182
Modified the LogIf function to log only if the error passed
is not on the ignored errors list.
Currently, only disk not found error is added to the list.
Added a new function in logger package called LogAlwaysIf,
which will print on any error.
Fixes#5997
This commit adds support for detecting SSE-KMS headers.
The server should be able to detect SSE-KMS headers to
at least fail such S3 requests with not implemented.
When a S3 client sends a GET Object with a range header, 206 http
code is returned indicating success, however the call of the object
layer's GetObject() inside the handler can return an error and will lead
to writing an XML error message, which is obviously wrong since
we already sent 206 http code. So in the case, we just stop sending
data to the S3 client, this latter can still detect if there is no
error when comparing received data with Content-Length header
in the Get Object response.
ANSI colors do not work on dumb terminals, in situations
when minio is running as a service under systemd.
This PR ensures we turn off color in those situations.
This is to avoid serializing RPC contention on ongoing
parallel operations, the blocking profile indicating
all calls were being serialized through setRetryTicker.
This commit adds the crypto.* errors to the
`toAPIErrorCode` switch. Further this commit adds an S3
API error code returned whenever the client specifes a
SSE-S3 request with an invalid algorithm parameter.
Fixes#6238
globalPolicySys used to be initialized in fs/xl layer. The referenced
commit moved this logic to server/gateway initialization,but a check
to avoid double initialization prevented globalPolicySys to be loaded
from disk for NAS.
fixes regression from commit be1700f595
No locks are ever left in memory, we also
have a periodic interval of clearing stale locks
anyways. The lock instrumentation was not complete
and was seldom used.
Deprecate this for now and bring it back later if
it is really needed. This also in-turn seems to improve
performance slightly.
POST mime/multipart upload style can have filename value optional
which leads to implementation issues in Go releases in their
standard mime/multipart library.
When `filename` doesn't exist Go doesn't update `form.File` which
we rely on to extract the incoming file data, strangely when `filename`
is not specified this data is buffered in memory and is now part of
`form.Value` instead of `form.File` which creates an inconsistent
behavior.
This PR tries to fix this in our code for the time being, but ideal PR
would be to fix the upstream mime/multipart library to handle the
above situation consistently.
This commit adds a `fmt.Stringer` implementation for
SSE-S3 and SSE-C. The string representation is the
domain used for object key sealing.
See: `ObjectKey.Seal(...)` and `ObjectKey.Unseal(...)`
Continuing from PR 157ed65c35
Our posix.go implementation did not handle I/O errors
properly on the disks, this led to situations where
top-level callers such as ListObjects might return early
without even verifying all the available disks.
This commit tries to address this in Kubernetes, drbd/nbd based
persistent volumes which can disconnect under load and
result in the situations with disks return I/O errors.
This commit also simplifies listing operation, listing
never returns any error. We can avoid this since we pretty
much ignore most of the errors anyways. When objects are
accessed directly we return proper errors.
* crypto: add support for parsing SSE-C/SSE-S3 metadata
This commit adds support for detecting and parsing
SSE-C/SSE-S3 object metadata. With the `IsEncrypted`
functions it is possible to determine whether an object
seems to be encrypted. With the `ParseMetadata` functions
it is possible to validate such metadata and extract the
SSE-C/SSE-S3 related values.
It also fixes some naming issues.
* crypto: add functions for creating SSE object metadata
This commit adds functions for creating SSE-S3 and
SSE-C metadata. It also adds a `CreateMultipartMetadata`
for creating multipart metadata.
For all functions unit tests are included.
Since implementing `pwrite` like implementation would
require a more complex code than background append
implementation, it is better to keep the current code
as is and not implement `pwrite` based functionality.
Closes#4881
Healthcheck handler in current implementation was
performing ListBuckets() to check for liveness of Minio
service. ListBuckets() implementation on the other hand
doesn't do quorum based listing and if one of the disks
returned error, an I/O error it would be lead to kubernetes
taking the minio pod down prematurely even if the disk
is not local to that minio server.
The reason is ListBuckets() call cannot be trusted to
provide us the valid information that we need, Minio is a
clustered application which is designed to handle disk
failures. Error on one of the disks doesn't mean the pod
should become fully non-operational.
This PR attempts to fix this by only checking for alive
disks which are local to each setup and also by simply
performing a Stat() operation, if the Stat() returned
error on all disks local to a particular server then
we can let kubernetes safely take it down, until then
we should be operational.
The current code for deleting 1000 objects simultaneously
causes significant random I/O, which on slower drives
leads to servers disconnecting in a distributed setup.
Simplify this by serially deleting and reducing the
chattiness of this operation.
Currently, requestid field in logEntry is not populated, as the
requestid field gets set at the very end.
It is now set before regular handler functions. This is also
useful in setting it as part of the XML error response.
Travis build for ppc64le has been quite inconsistent and stays queued
for most of the time. Removing this build as part of Travis.yml for
the time being.
- Add console target logging, enabled by default.
- Add http target logging, which supports an endpoint
with basic authentication (username/password are passed
in the endpoint url itself)
- HTTP target logging is asynchronous and some logs can be
dropped if channel buffer (10000) is full
In a small window, UI error tries to split lines for an eye candy
error message. However, since we show some docs.minio.io links in some
error messages, these links are actually broken and not easily selected
in a X terminal. This PR changes the behavior and won't split lines
anymore.
This commit adds basic support for SSE-C / SSE-C copy.
This includes functions for determining whether SSE-C
is requested by the S3 client and functions for parsing
such HTTP headers.
All S3 SSE-C parsing errors are exported such that callers
can pattern-match to forward the correct error to S3
clients.
Further the SSE-C related internal metadata entry-keys
are added by this commit.
This commit adds a basic KMS implementation for an
operator-specified SSE-S3 master key. The master key
is wrapped as KMS such that using SSE-S3 with master key
and SSE-S3 with KMS can use the same code.
Bindings for a remote / true KMS (like hashicorp vault)
will be added later on.
This commit updates the key derivation to reflect the
latest change of crypto/doc.go. This includes handling
the insecure legacy KDF.
Since #6064 is fixed, the 3. test case for object key
generation is enabled again.
With CoreDNS now supporting etcdv3 as the DNS backend, we
can update our federation target to etcdv3. Users will now be
able to use etcdv3 server as the federation backbone.
Minio will update bucket data to etcdv3 and CoreDNS can pick
that data up and serve it as bucket style DNS path.
This commit fixes the size calculation for multipart
objects. The decrypted size of an encrypted multipart
object is the sum of the decrypted part sizes.
Also fixes the key derivation in CopyObjectPart.
Instead of using the same object-encryption-key for each
part now an unique per-part key is derived.
Updates #6139
Minio server was preventing itself to start when any notification
target is down and not running. The PR changes the behavior by
avoiding startup abort in that case, so the user will still
be able to access Minio server using mc admin commands after
a restart or set config commands.
This commit fixes a weakness of the key-encryption-key
derivation for SSE-C encrypted objects. Before this
change the key-encryption-key was not bound to / didn't
depend on the object path. This allows an attacker to
repalce objects - encrypted with the same
client-key - with each other.
This change fixes this issue by updating the
key-encryption-key derivation to include:
- the domain (in this case SSE-C)
- a canonical object path representation
- the encryption & key derivation algorithm
Changing the object path now causes the KDF to derive a
different key-encryption-key such that the object-key
unsealing fails.
Including the domain (SSE-C) and encryption & key
derivation algorithm is not directly neccessary for this
fix. However, both will be included for the SSE-S3 KDF.
So they are included here to avoid updating the KDF
again when we add SSE-S3.
The leagcy KDF 'DARE-SHA256' is only used for existing
objects and never for new objects / key rotation.
This PR simplifies the code to avoid tracking
any running usage events. This PR also brings
in an upper threshold of upto 1 minute suspend
the usage function after which the usage would
proceed without waiting any longer.
This commit introduces a new crypto package providing
AWS S3 related cryptographic building blocks to implement
SSE-S3 (master key or KMS) and SSE-C.
This change only adds some basic functionallity esp.
related to SSE-S3 and documents the general approach
for SSE-S3 and SSE-C.
disk usage crawling is not needed when a tenant
is not sharing the same disk for multiple other
tenants. This PR adds an optimization when we
see a setup uses entire disk, we simply rely on
statvfs() to give us total usage.
This PR also additionally adds low priority
scheduling for usage check routine, such that
other go-routines blocked will be automatically
unblocked and prioritized before usage.
Minio server returns 403 (access denied) for head requests to prefixes
without trailing "/", this is different from S3 behaviour. S3 returns
404 in such cases.
Fixes#6080
This commit prevents complete server failures caused by
`logger.CriticalIf` calls. Instead of calling `os.Exit(1)`
the function now executes a panic with a special value
indicating that a critical error happend. At the top HTTP
handler layer panics are recovered and if its a critical
error the client gets an InternalServerError status code.
Further this allows unit tests to cover critical-error code
paths.
Add compile time GOROOT path to the list of prefix
of file paths to be removed.
Add webhandler function names to the slice that
stores function names to terminate logging.
During startup until the object layer is initialized
logger is disabled to provide for a cleaner UI error
message. CriticalIf is disabled, use FatalIf instead.
Also never call os.Exit(1) on running servers where
you can return error to client in handlers.
This commit limits the amount of memory allocated by the
S3 Multi-Object-Delete-API. The server used to allocate as
many bytes as provided by the client using Content-Length.
S3 specifies that the S3 Multi-Object-Delete-API can delete
at most 1000 objects using a single request.
(See: https://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html)
Since the maximum S3 object name is limited to 1024 bytes the
XML body sent by the client can only contain up to 1000 * 1024
bytes (excluding XML format overhead).
This commit limits the size of the parsed XML for the S3
Multi-Object-Delete-API to 2 MB. This fixes a DoS
vulnerability since (auth.) clients, MitM-adversaries
(without TLS) and un-auth. users accessing buckets allowing
multi-delete by policy can kill the server.
This behavior is similar to the AWS-S3 implementation.
This PR adds CopyObject support for objects residing in buckets
in different Minio instances (where Minio instances are part of
a federated setup).
Also, added support for multiple Minio domain IPs. This is required
for distributed deployments, where one deployment may have multiple
nodes, each with a different public IP.
Buckets already present on a Minio server before it joins a
bucket federated deployment will now be added to etcd during
startup. In case of a bucket name collision, admin is informed
via Minio server console message.
Added configuration migration for configuration stored in etcd
backend.
Also, environment variables are updated and ListBucket path style
request is no longer forwarded.
Added support for new RPC support using HTTP POST. RPC's
arguments and reply are Gob encoded and sent as HTTP
request/response body.
This patch also removes Go RPC based implementation.
With the implementation of dummy GET ACL handlers,
tools like s3cmd perform few operations which causes
the ACL call to be invoked. Make sure that in our
router configuration GET?acl comes before actual
GET call to facilitate this dummy call.
tests were written in the manner by editing internal
variables of fsObjects to mimic certain behavior from
APIs, but this is racy when an active go-routine is
reading from the same variable.
Make sure to terminate the go-routine if possible for
these tests.
The current problem is that when you invoke
```
mc admin info myminio | head -1
● localhost:9000
```
This output is incorrect as the expected output should be
```
mc admin info myminio | head -1
● 192.168.1.17:9000
```
This commit adds a check to the server's admin-API such that it only
accepts Admin-API requests with authenticated bodies. Further this
commit updates the `madmin` package to always add the
`X-Amz-Content-Sha256` header.
This change improves the Admin-API security since the server does not
accept unauthenticated request bodies anymore.
After this commit `mc` must be updated to the new `madmin` api because
requests over TLS connections will fail.
This commit fixes a DoS vulnerability for certain APIs using
signature V4 by verifying the content-md5 and/or content-sha56 of
the request body in a streaming mode.
The issue was caused by reading the entire body of the request into
memory to verify the content-md5 or content-sha56 checksum if present.
The vulnerability could be exploited by either replaying a V4 request
(in the 15 min time frame) or sending a V4 presigned request with a
large body.
Removed field minio_http_requests_total as it was redundant with
minio_http_requests_duration_seconds_count
Also removed field minio_server_start_time_seconds as it was
redundant with process_start_time_seconds
GetBucketACL call returns empty for all GET in ACL requests,
the primary purpose of this PR is to provide legacy API support
for legacy applications.
Fixes#5706
Better support of HEAD and listing of zero sized objects with trailing
slash (a.k.a empty directory). For that, isLeafDir function is added
to indicate if the specified object is an empty directory or not. Each
backend (xl, fs) has the responsibility to store that information.
Currently, in both of XL & FS, an empty directory is represented by
an empty directory in the backend.
isLeafDir() checks if the given path is an empty directory or not,
since dir listing is costly if the latter contains too many objects,
readDirN() is added in this PR to list only N number of entries.
In isLeadDir(), we will only list one entry to check if a directory
is empty or not.
This commit fixes a DoS vulnerability in the
request authentication. The root cause is an 'unlimited'
read-into-RAM from the request body.
Since this read happens before the request authentication
is verified the vulnerability can be exploit without any
access privileges.
This commit limits the size of the request body to 3 MB.
This is about the same size as AWS. The limit seems to be
between 1.6 and 3.2 MB - depending on the AWS machine which
is handling the request.
This commit ensures that all tickers are stopped using defer ticker.Stop()
style. This will also fix one bug seen when a client starts to listen to
event notifications and that case will result a leak in tickers.
Current healing has an issue when disks are healed
even when they are offline without knowing if disk
is unformatted. This can lead to issues of pre-maturely
removing the disk from the set just because it was
temporarily offline.
There is an increasing number of `mc admin heal` usage
on a cron or regular basis. It is possible that if healing
code saw disk is offline it might prematurely take it down,
this causes availability issues.
Fixes#5826
Previously we used allow bucket policies without
`Version` field to be set to any given value, but
this behavior is inconsistent with AWS S3.
PR #5790 addressed this by making bucket policies
stricter and cleaner, but this causes a breaking
change causing any existing policies perhaps without
`Version` field or the field to be empty to fail upon
server startup.
This PR brings a code to migrate under these scenarios
as a one time operation.
- remove old bucket policy handling
- add new policy handling
- add new policy handling unit tests
This patch brings support to bucket policy to have more control not
limiting to anonymous. Bucket owner controls to allow/deny any rest
API.
For example server side encryption can be controlled by allowing
PUT/GET objects with encryptions including bucket owner.
This change disables the non-constant-time implementations of P-384 and P-521.
As a consequence a client using just these curves cannot connect to the server.
This should be no real issues because (all) clients at least support P-256.
Further this change also rejects ECDSA private keys of P-384 and P-521.
While non-constant-time implementations for the ECDHE exchange don't expose an
obvious vulnerability, using P-384 or P-521 keys for the ECDSA signature may allow
pratical timing attacks.
Fixes#5844
- getBucketLocation
- headBucket
- deleteBucket
Should return 404 or NoSuchBucket even for invalid bucket names, invalid
bucket names are only validated during MakeBucket operation
This is an effort to remove panic from the source.
Add a new call called CriticialIf, that calls LogIf and exits.
Replace panics with one of CriticalIf, FatalIf and a return of error.
Make sure to apply standard headers such as Content-Type,
Content-Disposition and Content-Language to the correct
GCS object attributes during object upload and copy operations.
Fixes: #5800
As we move to multiple config backends like local disk and etcd,
config file should not be read from the disk, instead the quick
package should load and verify for duplicate entries.
This change adds some security headers like Content-Security-Policy.
It does not set the HSTS header because Content-Security-Policy prevents
mixed HTTP and HTTPS content and the server does not use cookies.
However it is a header which could be added later on.
It also moves some header added by #5805 from a vendored file
to a generic handler.
Fixes ##5813
Also make sure to not modify the underlying errors from
layers, we should return the error as is and one object
layer should translate the errors.
Fixes#5797
This PR introduces ReloadFormat API call at objectlayer
to facilitate this. Previously we repurposed HealFormat
but we never ended up updating our reference format on
peers.
Fixes#5700
This change let the server return the S3 error for a key rotation
if the source key is not valid but equal to the destination key.
This change also fixes the SSE-C error messages since AWS returns error messages
ending with a '.'.
Fixes#5625
This change sets the storage class of the object-info if a storage
class was specified during PUT. The server now replies with the
storage class which was set during uploading the object in FS mode.
Fixes#5777
Default installations of cloned VMs on VMware like env
might experience serious problems with time skewing,
allow for a higher value instead of 3 seconds we are
moving to 15 minutes just like API level skew.
Access to internet and configuring ntp might not be possible,
in such situations providing atleast a 15 minute skew could
cater for majority of situations.
Since we do not re-use storageDisks after moving
the connections to object layer we should close them
appropriately otherwise we have a lot of connection
leaks and these can compound as the time goes by.
This PR also refactors the initialization code to
re-use storageDisks for given set of endpoints until
we have confirmed a valid reference format.
An issue was reproduced when there a no more inodes
available on an existing setup of 4 disks, now we
took one of the disks and reformatted it to relinquish
inodes. Now we attempt to bring the fresh disk back
into setup and perform a heal - at this point creating
new `format.json` fails on existing disks since they
do not have more inodes available.
At this point due to quorum failure, we end up deleting
existing `format.json` as well, this PR removes the code
which deletes existing `format.json` as there is no need
to delete them.
Previous PR 2afd196c83 fixed
the issue of quorum based listing for regular objects, this
PR continues on this idea by extending this support to
object directory prefixes as well.
Fixes#5733
Set GOPATH string to empty in build-constants.go
Check for both compile time GOPATH and default GOPATH
while trimming the file path in the stack trace.
Fixes#5741
This PR fixes two different variant of deadlocks in
notification.
- holding write lock on the bucket competing with read lock
- holding competing locks on read/save notification config
This PR adds disk based edge caching support for minio server.
Cache settings can be configured in config.json to take list of disk drives,
cache expiry in days and file patterns to exclude from cache or via environment
variables MINIO_CACHE_DRIVES, MINIO_CACHE_EXCLUDE and MINIO_CACHE_EXPIRY
Design assumes that Atime support is enabled and the list of cache drives is
fixed.
- Objects are cached on both GET and PUT/POST operations.
- Expiry is used as hint to evict older entries from cache, or if 80% of cache
capacity is filled.
- When object storage backend is down, GET, LIST and HEAD operations fetch
object seamlessly from cache.
Current Limitations
- Bucket policies are not cached, so anonymous operations are not supported in
offline mode.
- Objects are distributed using deterministic hashing among list of cache
drives specified.If one or more drives go offline, or cache drive
configuration is altered - performance could degrade to linear lookup.
Fixes#4026
This is a trival fix to support server level WORM. The feature comes
with an environment variable `MINIO_WORM`.
Usage:
```
$ export MINIO_WORM=on
$ minio server endpoint
```
Object deletion should not be possible if quorum is not
available. This PR updates deleteObject() to check for
quorum errors before proceeding with object deletion.
Fixes#5535
This commit adds the bucket delete and bucket policy functionalities
to the browser.
Part of rewriting the browser code to follow best practices and
guidelines of React (issues #5409 and #5410)
The backend code has been modified by @krishnasrinivas to prevent
issue #4498 from occuring. The relevant changes have been made to the
code according to the latest commit and the unit tests in the backend.
This commit also addresses issue #5449.
Migration regression got introduced in 9083bc152e
adding more unit tests to catch this scenario, we need to fix this by
re-writing the formats after the migration to 'V3'.
This bug only happens when a user is migrating directly from V1 to V3,
not from V1 to V2 and V2 to V3.
Added additional unit tests to cover these situations as well.
Fixes#5667
- Add head method for healthcheck endpoint. Some platforms/users
may use the HTTP Head method to check for health status.
- Add liveness and readiness probe examples in Kubernetes yaml
example docs. Note that readiness probe not added to StatefulSet
example due to https://github.com/kubernetes/kubernetes/issues/27114
With following changes
- Add SSE and refactor encryption API (#942) <Andreas Auernhammer>
- add copyObject test changing metadata and preserving etag (#944) <Harshavardhana>
- Add SSE-C tests for multipart, copy, get range operations (#941) <Harshavardhana>
- Removing conditional check for notificationInfoCh in api-notication (#940) <Matthew Magaldi>
- Honor prefix parameter in ListBucketPolicies API (#929) <kannappanr>
- test for empty objects uploaded with SSE-C headers (#927) <kannappanr>
- Encryption headers should also be set during initMultipart (#930) <Harshavardhana>
- Add support for Content-Language metadata header (#928) <kannappanr>
- Fix check for duplicate notification configuration entries (#917) <kannappanr>
- allow OS to cleanup sockets in TIME_WAIT (#925) <Harshavardhana>
- Sign V2: Fix signature calculation in virtual host style (#921) <A. Elleuch>
- bucket policy: Support json string in Principal field (#919) <A. Elleuch>
- Fix copyobject failure for empty files (#918) <kannappanr>
- Add new constructor NewWithOptions to SDK (#915) <poornas>
- Support redirect headers to sign again with new Host header. (#829) <Harshavardhana>
- Fail in PutObject if invalid user metadata is passed <Harshavadhana>
- PutObjectOptions Header: Don't include invalid header <Isaac Hess>
- increase max retry count to 10 (#913) <poornas>
- Add new regions for Paris and China west. (#905) <Harshavardhana>
- fix s3signer to use req.Host header (#899) <Bartłomiej Nogaś>
This PR adds readiness and liveness endpoints to probe Minio server
instance health. Endpoints can only be accessed without authentication
and the paths are /minio/health/live and /minio/health/ready for
liveness and readiness respectively.
The new healthcheck liveness endpoint is used for Docker healthcheck
now.
Fixes#5357Fixes#5514
In kubernetes statefulset like environments when secrets
are mounted to pods they have sub-directories, we should
ideally be only looking for regular files here and skip
all others.
Fix a compatibility issue with AWS S3 where to do key rotation
we need to replace an existing object's metadata. In such a
scenario "REPLACE" metadata directive is not necessary.
- Data from disk was being read after bitrot verification to return
data for GetObject. Strictly speaking this does not guarantee bitrot
protection, as disks may return bad data even temporarily.
- This fix reads data from disk, verifies data for bitrot and then
returns data to the client directly.
Current code didn't implement the logic to support
decrypting encrypted multiple parts, this PR fixes
by supporting copying encrypted multipart objects.
Currently we reply back `X-Minio-Internal` values
back to the client for an encrypted object, we should
filter these out and only reply AWS compatible headers.
*) Add Put/Get support of multipart in encryption
*) Add GET Range support for encryption
*) Add CopyPart encrypted support
*) Support decrypting of large single PUT object
Flags like `json, config-dir, quiet` are now honored even if they are
between minio and gateway in the cli, like, `minio --json gateway s3`.
Fixes#5403
Stable sort is needed when we are sorting based on two or more
distinct elements. When equal elements are indistinguishable,
such as with integers, or more generally, any data where the
entire element is the key like `PartNumber`, stability is not
an issue.
Refactor such that metadata and etag are
combined to a single argument `srcInfo`.
This is a precursor change for #5544 making
it easier for us to provide encryption/decryption
functions.
Delete & Multi Delete API should not try to remove the directory content.
The only permitted case is with zero size object with a trailing slash
in its name.
MaxIdleConns limits the total number of connections
kept in the pool for re-use. In addition, MaxIdleConnsPerHost
limits the number for a single host. Since minio gateways
usually connect to the same host, setting `MaxIdleConns = 100`
won't really have much of an impact since the idle connection
pool is limited to 2 anyway.
Now, with the pool set to a limit of 2, and when using
the client heavily from 2+ goroutines, the `http.Transport`
will open a connection, use it, then try to return it to
the idle-pool which often fails since there's a limit of 2.
So it's going to close the connection and new ones will be
opened on demand again, many of which get closed soon after
being used. Since those connections/sockets don't disappear
from the OS immediately, use `MaxIdleConnsPerHost = 100`
which fixes this problem.
Overwriting files is allowed, but since the introduction of
the object directory, we will aslo need to allow overwriting
an empty directory. Putting twice the same object directory
won't fail with 403 error anymore.
TestNewWebHookNotify wasn't passing in my local machine. The reason is
that the test expects the POST handler (as a webhook endpoint) is always
running on port 80, which is not always the case.
This PR implements an object layer which
combines input erasure sets of XL layers
into a unified namespace.
This object layer extends the existing
erasure coded implementation, it is assumed
in this design that providing > 16 disks is
a static configuration as well i.e if you started
the setup with 32 disks with 4 sets 8 disks per
pack then you would need to provide 4 sets always.
Some design details and restrictions:
- Objects are distributed using consistent ordering
to a unique erasure coded layer.
- Each pack has its own dsync so locks are synchronized
properly at pack (erasure layer).
- Each pack still has a maximum of 16 disks
requirement, you can start with multiple
such sets statically.
- Static sets set of disks and cannot be
changed, there is no elastic expansion allowed.
- Static sets set of disks and cannot be
changed, there is no elastic removal allowed.
- ListObjects() across sets can be noticeably
slower since List happens on all servers,
and is merged at this sets layer.
Fixes#5465Fixes#5464Fixes#5461Fixes#5460Fixes#5459Fixes#5458Fixes#5460Fixes#5488Fixes#5489Fixes#5497Fixes#5496
Since we do not encrypt directories we don't need to send
errors with encryption headers when the directory doesn't
have encryption metadata.
Continuation PR from 4ca10479b5
It can happen such that one of the disks that was down would
return 'errDiskNotFound' but the err is preserved due to
loop shadowing which leads to issues when healing the bucket.
This change adds an object size check such that the server does not
encrypt empty objects (typically folders) for SSE-C. The server still
returns SSE-C headers but the object is not encrypted since there is no
point to encrypt such objects.
Fixes#5493
Currently minio master requires 4 servers, we
have decided to run on a minimum of 2 servers
instead - fixes a regression from previous
releases where 3 server setups were supported.
This PR brings semver capabilities in our RPC layer to
ensure that we can upgrade the servers in rolling fashion
while keeping I/O in progress. This is only a framework change
the functionality remains the same as such and we do not
have any special API changes for now. But in future when
we bring in API changes we will be able to upgrade servers
without a downtime.
Additional change in this PR is to not abort when serverVersions
mismatch in a distributed cluster, instead wait for the quorum
treat the situation as if the server is down. This allows
for administrator to properly upgrade all the servers in the cluster.
Fixes#5393
in-memory caching cannot be cleanly implemented
without the access to GC which Go doesn't naturally
provide. At times we have seen that object caching
is more of an hindrance rather than a boon for
our use cases.
Removing it completely from our implementation
related to #5160 and #5182
This is a generic minimum value. The current reason is to support
Azure blob storage accounts name whose length is less than 5. 3 is the
minimum length for Azure.
Check if the storage class is set in an
non XL setup instead of relying on `globalEndpoints`
value. Also converge the checks for both SS
and RRS parity configuration.
This PR also removes redundant `tt.name` in all
test cases, since each testcase doesn't need to
be numbered explicitly they are numbered implicitly.
* Update the GetConfig admin API to use the latest version of
configuration, along with fixes to the corresponding RPCs.
* Remove mutex inside the configuration struct, and inside
notification struct.
* Use global config mutex where needed.
* Add `serverConfig.ConfigDiff()` that provides a more granular diff
of what is different between two configurations.
- Changes related to moving admin APIs
- admin APIs now have an endpoint under /minio/admin
- admin APIs are now versioned - a new API to server the version is
added at "GET /minio/admin/version" and all API operations have the
path prefix /minio/admin/v1/<operation>
- new service stop API added
- credentials change API is moved to /minio/admin/v1/config/credential
- credentials change API and configuration get/set API now require TLS
so that credentials are protected
- all API requests now receive JSON
- heal APIs are disabled as they will be changed substantially
- Heal API changes
Heal API is now provided at a single endpoint with the ability for a
client to start a heal sequence on all the data in the server, a
single bucket, or under a prefix within a bucket.
When a heal sequence is started, the server returns a unique token
that needs to be used for subsequent 'status' requests to fetch heal
results.
On each status request from the client, the server returns heal result
records that it has accumulated since the previous status request. The
server accumulates upto 1000 records and pauses healing further
objects until the client requests for status. If the client does not
request any further records for a long time, the server aborts the
heal sequence automatically.
A heal result record is returned for each entity healed on the server,
such as system metadata, object metadata, buckets and objects, and has
information about the before and after states on each disk.
A client may request to force restart a heal sequence - this causes
the running heal sequence to be aborted at the next safe spot and
starts a new heal sequence.
In current implementation we used as many dsync clients
as per number of endpoints(along with path) which is not
the expected implementation. The implementation of Dsync
was expected to be just for the endpoint Host alone such
that if you have 4 servers and each with 4 disks we need
to only have 4 dsync clients and 4 dsync servers. But
we currently had 8 clients, servers which in-fact is
unexpected and should be avoided.
This PR brings the implementation back to its original
intention. This issue was found #5160
This change is a simplification over existing
code since it is not required to have a separate
RPCClient structure instead keep authRPCClient can
do the same job.
There is no code which directly uses netRPCClient(),
keeping authRPCClient is better and simpler. This
simplication also allows for removal of multiple
levels of locking code per object.
Observed in #5160
This change adds the HighwayHash256 PRF as bitrot protection / detection
algorithm. Since HighwayHash256 requires a 256 bit we generate a random
key from the first 100 decimals of π - See nothing-up-my-sleeve-numbers.
This key is fixed forever and tied to the HighwayHash256 bitrot algorithm.
Fixes#5358
The problem was after the globalServiceDoneCh receives a
message, we cleanly stop the ticker as expected. But the
go-routine where the `select` loop is running is never
returned from. The stage at which point this may occur
i.e server is being restarted, doesn't seriously affect
servers usage. But any build up like this on server has
consequences as the new functionality would come in future.
With storage class support, the free and total space
reported in Minio XL startup banner should be based on
totalDisks - standardClassParityDisks, instead of totalDisks/2.
fixes#5416
This change replaces all imports of "crypto/sha256" with
"github.com/minio/sha256-simd". The sha256-simd package
is faster on ARM64 (NEON instructions) and can take advantage
of AVX-512 in certain scenarios.
Fixes#5374
Internally, triton-go, what manta minio is built on, changed it's internal
error handling. This means we no longer need to unwrap specific error types
This doesn't change any manta minio functionality - it just changes how errors are
handled internally and adds a wrapper for a 404 error
This change fixes an authentication bypass attack against the
minio Admin-API. Therefore the Admin-API rejects now all types of
requests except valid signature V2 and signature V4 requests - this
includes signature V2/V4 pre-signed requests.
Fixes#5411
This fix removes logrus package dependency and refactors the console
logging as the only logging mechanism by removing file logging support.
It rearranges the log message format and adds stack trace information
whenever trace information is not available in the error structure.
It also adds `--json` flag support for server logging.
When minio server is started with `--json` flag, all log messages are
displayed in json format, with no start-up and informational log
messages.
Fixes#5265#5220#5197
Under any concurrent removeObjects in progress
might have removed the parents of the same prefix
for which there is an ongoing putObject request.
An inconsistent situation may arise as explained
below even under sufficient locking.
PutObject is almost successful at the last stage when
a temporary file is renamed to its actual namespace
at `a/b/c/object1`. Concurrently a RemoveObject is
also in progress at the same prefix for an `a/b/c/object2`.
To create the object1 at location `a/b/c` PutObject has
to create all the parents recursively.
```
a/b/c - os.MkdirAll loops through has now created
'a/' and 'b/' about to create 'c/'
a/b/c/object2 - at this point 'c/' and 'object2'
are deleted about to delete b/
```
Now for os.MkdirAll loop the expected situation is
that top level parent 'a/b/' exists which it created
, such that it can create 'c/' - since removeObject
and putObject do not compete for lock due to holding
locks at different resources. removeObject proceeds
to delete parent 'b/' since 'c/' is not yet present,
once deleted 'os.MkdirAll' would receive an error as
syscall.ENOENT which would fail the putObject request.
This PR tries to address this issue by implementing
a safer/guarded approach where we would retry an operation
such as `os.MkdirAll` and `os.Rename` if both operations
observe syscall.ENOENT.
Fixes#5254
After the addition of Storage Class support, readQuorum
and writeQuorum are decided on a per object basis, instead
of deployment wide static quorums.
This PR updates madmin api to remove readQuorum/writeQuorum
and add Standard storage class and reduced redundancy storage
class parity as return values. Since these parity values are
used to decide the quorum for each object.
Fixes#5378
Since the server performs automatic clean-up of multipart uploads that
have not been resumed for more than a couple of weeks, it was decided
to remove functionality to heal multipart uploads.
If STANDARD storage class is set before starting up Minio server,
but x-amz-storage-class metadata field is not set in a PutObject
request, Minio server defaults to N/2 data and N/2 parity disks.
This PR changes the behaviour to use data and parity disks set in
STANDARD storage class, even if x-amz-storage-class metadata
field is not present in PutObject requests.
- Return error when the config JSON has duplicate keys (fixes#5286)
- Limit size of configuration file provided to 256KiB - this prevents
another form of DoS
Remove the requirement for IssuedAt claims from JWT
for now, since we do not currently have a way to provide
a leeway window for validating the claims. Expiry does
the same checks as IssuedAt with an expiry window.
We do not need it right now since we have clock skew check
in our RPC layer to handle this correctly.
rpc-common.go
```
func isRequestTimeAllowed(requestTime time.Time) bool {
// Check whether request time is within acceptable skew time.
utcNow := UTCNow()
return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed ||
utcNow.Sub(requestTime) > rpcSkewTimeAllowed)
}
```
Once the PR upstream is merged https://github.com/dgrijalva/jwt-go/pull/139
We can bring in support for leeway later.
Fixes#5237
x-amz-content-sha256 can be optional for any AWS signature v4
requests, make sure to skip sha256 calculation when payload
checksum is not set.
Here is the overall expected behavior
** Signed request **
- X-Amz-Content-Sha256 is set to 'empty' or some 'value' or its
not 'UNSIGNED-PAYLOAD'- use it to validate the incoming payload.
- X-Amz-Content-Sha256 is set to 'UNSIGNED-PAYLOAD' - skip checksum verification
- X-Amz-Content-Sha256 is not set we use emptySHA256
** Presigned request **
- X-Amz-Content-Sha256 is set to 'empty' or some 'value' or its
not 'UNSIGNED-PAYLOAD'- use it to validate the incoming payload
- X-Amz-Content-Sha256 is set to 'UNSIGNED-PAYLOAD' - skip checksum verification
- X-Amz-Content-Sha256 is not set we use 'UNSIGNED-PAYLOAD'
Fixes#5339
This PR updates the behaviour to print relevant error message
if storage class is set in config.json for gateway
This PR also fixes the case where storage class set via
environment variables is not parsed properly into config.json.
Save http trace to a file instead of displaying it onto the console.
the environment variable MINIO_HTTP_TRACE will be a filepath instead
of a boolean.
This to handle the scenario where both json and http tracing are
turned on. In that case, both http trace and json output are displayed
on the screen making the json not parsable. Loging this trace onto
a file helps us avoid that scenario.
Fixes#5263
Manta has the ability to allow users to authenticate with a
username other than the main account. We want to expose
this functionality to minio manta gateway.
This change adds support for password-protected private keys.
If the private key is encrypted the server tries to decrypt
the key with the password provided by the env variable
MINIO_CERT_PASSWD.
Fixes#5302
- Update startup banner to print storage class in capitals. This
makes it easier to identify different storage classes available.
- Update response metadata to not send STANDARD storage class.
This is in accordance with AWS S3 behaviour.
- Update minio-go library to bring in storage class related
changes. This is needed to make transparent translation of
storage class headers for Minio S3 Gateway.
Currently, browser access information is displayed without checking
if browser enabled flag is turned off in config.json. Fixing it to
hide the information if the flag is turned off.
Fixes#5312
This change replaces the non-constant time comparison of
request signatures with a constant time implementation. This
prevents a timing attack which can be used to learn a valid
signature for a request without knowing the secret key.
Fixes#5334
This commit takes the existing remove bucket functionality written by
brendanashworth, integrates it to the current UI with a dropdown for
each bucket, and fixes small issues that were present, like the dropdown
not disappearing after the user clicks on 'Delete' for certain buckets.
This feature only deletes a bucket that is empty (that has no objects).
Fixes#4166
- Add storage class metadata validation for request header
- Change storage class header values to be consistent with AWS S3
- Refactor internal method to take only the reqd argument
HealFile() does not process the case when an empty file is lost in
some disks. Since, Reedsolomon erasure doesn't handle restoring empty
data, HealFile will create empty files similarly to CreateFile().
This adds configurable data and parity options on a per object
basis. To use variable parity
- Users can set environment variables to cofigure variable
parity
- Then add header x-amz-storage-class to putobject requests
with relevant storage class values
Fixes#4997
- Use it to send the Content-MD5 header correctly encoded to S3
Gateway
- Fixes a bug in PutObject (including anonymous PutObject) and
PutObjectPart with S3 Gateway found when testing with Mint.
Manta is an Object Storage by [Joyent](https://www.joyent.com/)
This PR adds initial support for Manta. It is intended as non-production
ready so that feedback can be obtained.
This PR allows 'minio update' to not only shows update banner
but also allows for in-place upgrades.
Updates are done safely by validating the downloaded
sha256 of the binary.
Fixes#4781
This PR handles following situations
- secure endpoints provided, server should fail to start
if TLS is not configured
- insecure endpoints provided, server starts ignoring
if TLS is configured or not.
Fixes#5251
- Adds a metadata argument to the CopyObjectPart API to facilitate
implementing encryption for copying APIs too.
- Update vendored minio-go - this version implements the
CopyObjectPart client API for use with the S3 gateway.
Fixes#4885
This check incorrectly rejects most valid filenames. The only filenames Sia
forbids are leading forward slashes and path traversal characters, but it's
better to simply allow Sia to reject invalid names on its own rather than try
to anticipate errors from Sia:
https://github.com/NebulousLabs/Sia/blob/master/doc/api/Renter.md#path-parameters-4
The problem in existing code was the following line
```
start := int(keyCrc%uint32(cardinality)) | 1
```
A given a value of N cardinality the ending result
because of the the bitwise '|' would lead to always
higher affinity to odd sequences.
As can be seen from the test cases that this can
lead to many objects being allocated the same set
of disks or atleast the first disk is an odd disk
always. This introduces a performance problem
for majority of the objects under concurrent load.
Remove `| 1` to provide a more cleaner distribution
and the new code will be.
```
start := int(keyCrc % uint32(cardinality))
```
Thanks to Krishna Srinivas for pointing out the bitwise
situation here.
This change introduces following simplified steps to follow
during config migration.
```
// Steps to move from version N to version N+1
// 1. Add new struct serverConfigVN+1 in config-versions.go
// 2. Set configCurrentVersion to "N+1"
// 3. Set serverConfigCurrent to serverConfigVN+1
// 4. Add new migration function (ex. func migrateVNToVN+1()) in config-migrate.go
// 5. Call migrateVNToVN+1() from migrateConfig() in config-migrate.go
// 6. Make changes in config-current_test.go for any test change
```
Current implementation we faked the makeBucket operations
to allow for s3 clients to behave properly. But instead
we can create a placeholder zero byte file instead, which
is a hexadecimal representation of the bucket name itself.
The Sia gateway had a bug with uploading that prevented the user's uploads
from reaching the Sia backend. The PutObject function called fsRemoveFile at
the end of the function, which didn't give the Sia backend enough time to
upload the file to the Sia network.
This adds a goroutine that watches the file upload progress and doesn't delete
the file until the upload reaches 100% complete.
Note that this solution has the limitation where if the minio process dies in
the middle of upload, it will leave orphaned files in the SIA_TEMP directory
that the user will need to remove manually.
This PR changes the behavior of DecryptRequest.
Instead of returning `object-tampered` if the client provided
key is wrong DecryptRequest will return `access-denied`.
This is AWS S3 behavior.
Fixes#5202
Apache Spark sends getObject requests with trailing "/".
This PR updates the getObjectInfo to stat for files
even if they are sent with trailing "/".
Fixes#2965
Previously ListenBucketNotificationHandler could deadlock with
PutObjectHandler's eventNotify call when a client closes its
connection. This change removes the cyclic dependency between the
channel and map of ARN to channels by using a separate done channel to
signal that the client has quit.
This change brings public data-types such that
we can ask projects to implement gateway projects
externally than maintaining in our repo.
All publicly exported structs are maintained in object-api-datatypes.go
completePart --> CompletePart
uploadMetadata --> MultipartInfo
All other exported errors are at object-api-errors.go
S3 spec requires that MethodNotAllowed error be return if object name is part
of the URL.
Fix postpolicy related unit tests to not set object name as part of target URL.
Fixes#5141
On windows having a preceding "/" will cause problems, if the
command line already has C:/<export-folder/ in it. Final resulting
path on windows might become C:/C:/ this will cause problems
of starting minio server properly in distributed mode on windows.
As a special case make sure to trim off the separator.
NOTE: It is also perfectly fine for windows users to have a path
without C:/ since at that point we treat it as relative path
and obtain the full filesystem path as well. Providing C:/
style is necessary to provide paths other than C:/,
such as F:/, D:/ etc.
Another additional benefit here is that this style also
supports providing UNC paths as well.
Fixes#5136
This chnage replaces the current SSE-C key derivation scheme. The 'old'
scheme derives an unique object encryption key from the client provided key.
This key derivation was not invertible. That means that a client cannot change
its key without changing the object encryption key.
AWS S3 allows users to update there SSE-C keys by executing a SSE-C COPY with
source == destination. AWS probably updates just the metadata (which is a very
cheap operation). The old key derivation scheme would require a complete copy
of the object because the minio server would not be able to derive the same
object encryption key from a different client provided key (without breaking
the crypto. hash function).
This change makes the key derivation invertible.
This change adds server-side-encryption support for HEAD, GET and PUT
operations. This PR only addresses single-part PUTs and GETs without
HTTP ranges.
Further this change adds the concept of reserved object metadata which is required
to make encrypted objects tamper-proof and provide API compatibility to AWS S3.
This PR adds the following reserved metadata entries:
- X-Minio-Internal-Server-Side-Encryption-Iv ('guarantees' tamper-proof property)
- X-Minio-Internal-Server-Side-Encryption-Kdf (makes Key-MAC computation negotiable in future)
- X-Minio-Internal-Server-Side-Encryption-Key-Mac (provides AWS S3 API compatibility)
The prefix `X-Minio_Internal` specifies an internal metadata entry which must not
send to clients. All client requests containing a metadata key starting with `X-Minio-Internal`
must also rejected. This is implemented by a generic-handler.
This PR implements SSE-C separated from client-side-encryption (CSE). This cannot decrypt
server-side-encrypted objects on the client-side. However, clients can encrypted the same object
with CSE and SSE-C.
This PR does not address:
- SSE-C Copy and Copy part
- SSE-C GET with HTTP ranges
- SSE-C multipart PUT
- SSE-C Gateway
Each point must be addressed in a separate PR.
Added to vendor dir:
- x/crypto/chacha20poly1305
- x/crypto/poly1305
- github.com/minio/sio
It is possible that x-amz-content-sha256 is set through
the query params in case of presigned PUT calls, make sure
that we validate the incoming x-amz-content-sha256 properly.
Current code simply just allows this without honoring the
set x-amz-content-sha256, fix it.
Previously ID/ETag from backend service is used as is which causes
failure on s3cmd like tools where those tools use ETag as checksum to
validate data. This is fixed by prepending "-1".
Refer minio/mint#193minio/mint#201
When MINIO_TRACE_DIR is provided, create a new log file and store all
HTTP requests + responses data, body are excluded to reduce memory
consumption. MINIO_HTTP_TRACE=1 enables logging. Use non mem
consuming http req/resp recorders, the maximum is about 32k per request.
This logs to STDOUT, body logging is disabled for PutObject PutObjectPart
GetObject.
Verify() was being called by caller after the data
has been successfully read after io.EOF. This disconnection
opens a race under concurrent access to such an object.
Verification is not necessary outside of Read() call,
we can simply just do checksum verification right inside
Read() call at io.EOF.
This approach simplifies the usage.
In some cases, Cache manager returns ErrCacheFull error when creating a
new cache buffer but the code still sends object data to nil cache buffer data.
Dont print the error errFileNotFound, as it is expected that concurrent
complete-multipart-uploads or abort-multipart-uploads would have deleted
the file, and the file may not be found
Fixes: https://github.com/minio/minio/issues/5056
Every so often we get requirements for creating
directories/prefixes and we end up rejecting
such requirements. This PR implements this and
allows empty directories without any new file
addition to backend.
Existing lower APIs themselves are leveraged to provide
this behavior. Only FS backend supports this for
the time being as desired.
s3cmd cli fails when trying to upload a file to azure gateway.
Previous fixes in azure to handle client side encryption alone
did not completely address the problem.
We need to possibilly convert all the x-amz-meta-<name>
, i.e specifically <name> should be converted into a
C# identifier as mentioned in the docs for `put-blob`.
https://docs.microsoft.com/en-us/rest/api/storageservices/put-blob
```
s3cmd put README.md s3://myanis/
upload: 'README.md' -> 's3://myanis/README.md' [1 of 1]
4598 of 4598 100% in 0s 47.24 kB/s done
upload: 'README.md' -> 's3://myanis/README.md' [1 of 1]
4598 of 4598 100% in 0s 50.47 kB/s done
ERROR: S3 error: 400 (InvalidArgument): Your metadata headers are not supported.
```
There is a separate issue with s3cmd after this fix is applied where
the ETag is wronly validated https://github.com/s3tools/s3cmd/issues/880
But that is an upstream s3cmd problem which wrongly interprets ETag
to be md5sum of the content that was uploaded.
This PR addresses a long standing dependency on
`gopkg.in/check.v1` project used for our tests.
All tests are re-written to use the go default
testing framework instead.
There was no reason for us to use an external
package where Go tools are sufficient for this.
This is done to avoid repeated declaration of not-implemented
functions for each gateway. It also avoids a possible bug in go
https://github.com/golang/go/issues/18468 which is triggered on
our multiple PRs already.
- Add release-time conversion helpers
- Split GetCurrentReleaseTime() into two simpler functions.
- Avoid appending strings when assembling user-agent string.
- Reorder release info URLs to check the newer URLs earlier.
- Remove trivial low-level functions created solely for the purpose of
writing tests.
- Remove some unnecessary tests.
Amazon S3 API expects all incoming stream has a content-length
set it was superflous for us to support object layer which supports
unknown sized stream as well, this PR removes such requirements
and explicitly error out if input stream is less than zero.
* Enable ListMultipartUploads and ListObjectParts for FS.
Previously we had disabled ListMultipartUploads and ListObjectParts
to see if any clients break. Docker registry broke. This patch
enables ListMultipartUploads and ListObjectParts, however
ListMultipartUploads with prefix based listing is not
supported (which is not used by docker registry anyway).
i.e ListMultipartUploads will need exact object name.
Gateway implementation of ListObjectsV1 does not validate maxKeys range.
Raise an InvalidArgument when maxKeys is negative so that ListObjects
call is compatible with S3 on all gateways.
Gateway interface implementations of GetBucketInfo() under
azure and s3 gateway did not perform any bucketname input
validation resulting in incorrect responses when the tests
are expecting InvalidBucketName.
Fixes#4983
When running `make test` in docker, two test cases cause hanging.
This Patch fixes the problem by removing those test cases.
Thanks to @ws141 for identifying the problem.
The reedsolomon library now avoids allocations during reconstruction.
This change exploits that to reduce memory allocs and GC preasure during
healing and reading.
Previously we were wrongly adding `?` as part
of the resource name, add a test case to check
if this is handled properly.
Thanks to @kannappanr for reproducing this.
Without this change presigned URL generated with following
command would fail with signature mismatch.
```
aws s3 presign s3://testbucket/functional-tests.sh
```
It can happen that an incoming PutObject() request might
have inputs of following form eg:-
- bucketName is 'testbucket'
- objectName is '/'
bucketName exists and was previously created but there
are no other objects in this bucket. In a situation like
this parentDirIsObject() goes into an infinite loop.
Verifying that if '/' is an object fails on both backends
but the resulting `path.Dir('/')` returns `'/'` this causes
the closure to loop onto itself.
Fixes#4940
This change removes the ReadFileWithVerify function from the
StorageAPI. The ReadFile was basically a redirection to ReadFileWithVerify.
This change removes the redirection and moves the logic of
ReadFileWithVerify directly into ReadFile.
This removes a lot of unnecessary code in all StorageAPI implementations.
Fixes#4946
* review: fix doc and typos
Previously init multipart upload stores metadata of an object which is
used for complete multipart. This patch makes azure gateway to store
metadata information of init multipart object in azure in the name of
'minio.sys.tmp/multipart/v1/<UPLOAD-ID>/meta.json' and uses this
information on complete multipart.
This change refactor the ObjectLayer PutObject and PutObjectPart
functions. Instead of passing an io.Reader and a size to PUT operations
ObejectLayer expects an HashReader.
A HashReader verifies the MD5 sum (and SHA256 sum if required) of the object.
This change updates all all PutObject(Part) calls and removes unnecessary code
in all ObjectLayer implementations.
Fixes#4923
This is an improvement upon existing implementation
by avoiding transfer of access and secret keys over
the network. This change only exchanges JWT tokens
generated by an rpc client. Even if the JWT can be
traced over the network on a non-TLS connection, this
change makes sure that we never really expose the
secret key over the network.
Previously minio gateway returns invalid bucket name error for invalid
meta data. This is fixed by returning BadRequest with 'Unsupported
metadata' in response.
Fixes#4891
When servers are started simultaneously across multiple
nodes or simulating a local setup, it can happen such
that one of the servers in setup reaches a following
situation where it observes
- Some servers are formatted
- Some servers are unformatted
- Some servers are offline
Current state machine doesn't handle this correctly, to fix
this situation where we have unformatted, formatted and
disks offline we do not decisively know the course of
action. So we wait for the offline disks to change their state.
Once the offline disks change their state to either one of these
states we can decisively move forward.
- nil (formatted disk)
- errUnformattedDisk
- Or any other error such as errCorruptedDisk.
Fixes#4903
The default timeout of 30secs is not enough for high latency
environments, change these values to use 15 minutes instead.
With 30secs I/O timeouts seem to be quite common, this leads
to pretty much most SDKs and clients reconnect. This in-turn
causes significant performance problems. On a low latency
interconnect this can be quite challenging to transfer large
amounts of data. Setting this value to 15minutes covers
pretty much all known cases.
This PR was tested with `wondershaper <NIC> 20000 20000` by
limiting the network bandwidth to 20Mbit/sec. Default timeout
caused a significant amount of I/O timeouts, leading to
constant retires from the client. This seems to be more common
with tools like rclone, restic which have high concurrency set
by default. Once the value was fixed to 15minutes i/o timeouts
stopped and client could steadily upload data to the server
even while saturating the network.
Fixes#4670
Previously if any multipart part size > 100MiB is uploaded, azure
gateway returns error.
This patch fixes the issue by creating sub parts sizing each 100MiB of
given multipart part. On complete multipart, it fetches all uploaded
azure block ids for each parts and performs completion.
Fixes#4868
- Region handling can now use region endpoints directly.
- All uploads are streaming no more large buffer needed.
- Major API overhaul for CopyObject(dst, src)
- Fixes bugs present in existing code for copying
- metadata replace directive CopyObject
- PutObjectPart doesn't require md5Sum and sha256
All `net/rpc` requests go to `/minio`, so the existing
generic handler for reserved bucket check would essentially
erroneously send errors leading to distributed setups to
wait infinitely.
For `net/rpc` requests alone we should skip this check and
allow resource bucket names to be from `/minio` .
Current code was just using io.ReadAll() on an fd()
which might have moved underneath due to a concurrent
read operation. Subsequent read will result in EOF
We should always seek back and read again. pread()
is allowed on all platforms use io.SectionReader to
read from the beginning of the file.
Fixes#4842
Bcrypt is not neccessary and not used properly. This change
replace the whole bcrypt hash computation through a constant time
compare and removes bcrypt from the code base.
Fixes#4813
If a TopicConfiguration element or CloudFunction element is found in
configuration submitted to PutBucketNotification API, an BadRequest
error is returned.
S3 only allows http headers with a size of 8 KB and user-defined metadata
with a size of 2 KB. This change adds a new API error and returns this
error to clients which sends to large http requests.
Fixes#4634
We don't need to typecast identifiers from
their base to type to same type again. This
is not a bug and compiler is fine to skip
it but it is better to avoid if not needed.
This change provides new implementations of the XL backend operations:
- create file
- read file
- heal file
Further this change adds table based tests for all three operations.
This affects also the bitrot algorithm integration. Algorithms are now
integrated in an idiomatic way (like crypto.Hash).
Fixes#4696Fixes#4649Fixes#4359
Wait for remote hosts to resolve instead of failing on first host
resolution error, when running in Kubernetes or Docker environment.
Note that
- Waiting is based on exponential back-off mechanism
- If run as a binary, server fails if remote host is not resolvable
This is needed because in orchestration platforms like Kubernetes, remote
hosts are started sequentially and all the hosts are not up initially,
though they are expected to come up in a short time frame
It is difficult to identify a cap on the waiting time due to
non-deterministic nature of infrastructure platforms, so the server waits
infinitely for the hosts to come up, while logging the error messages to
the console.
Fixes: https://github.com/minio/minio/issues/4669
Since go1.8 os.RemoveAll and os.MkdirAll both support long
path names i.e UNC path on windows. The code we are carrying
was directly borrowed from `pkg/os` package and doesn't need
to be in our repo anymore. As a side affect this also
addresses our codecoverage issue.
Refer #4658
* Prevent unnecessary verification of parity blocks while reading erasure
coded file.
* Update klauspost/reedsolomon and just only reconstruct data blocks while
reading (prevent unnecessary parity block reconstruction)
* Remove Verification of (all) reconstructed Data and Parity blocks since
in our case we are protected by bit rot protection. And even if the
verification would fail (essentially impossible) there is no way to
definitively say whether the data is still correct or not, so this call
make no sense for our use case.
Implement an offline mode for remote storage to cache the
offline status of a node in order to prevent network calls
that are bound to fail. After a time interval an attempt
will be made to restore the connection and mark the node
as online if successful.
Fixes#4183
It is possible at times due to a typo when distributed mode was intended
a user might end up starting standalone erasure mode causing confusion.
Add code to check this based on some standard heuristic guess work and
report an error to the user.
Fixes#4686
Under the call flow
```
Readdir
+
|
|
| path-entry
|
|
v
StatDir
```
Existing code was written in a manner where say
a bucket/top-level directory was indeed deleted
between Readdir() and before StatDir() we would
ignore certain errors. This is not a plausible
situation and might not happen in almost all
practical cases. We do not have to look for
or interpret these errors returned by StatDir()
instead we can just collect the successful
values and return back to the client. We do not
need to pre-maturely decide on bucket access
we just let filesystem decide subsequently for
real I/O operations.
Refer #4658
This is in preparation for updated admin heal API.
* Improve case analysis of healFormatXL() - fixes a case where disks
could have unhandled errors.
* Simplify healFormatXLFreshDisks() and healFormatXLCorruptedDisks()
to share more code and handle fewer cases for improved simplicity
and reduced code repetition.
* Fix test cases.
This commit changes posix's deleteFile() to not upstream errors from
removing parent directories. This fixes a race condition.
The race condition occurs when multiple deleteFile()s are called on the
same parent directory, but different child files. Because deleteFile()
recursively removes parent directories if they are empty, but
deleteFile() errors if the selected deletePath does not exist, there was
an opportunity for a race condition. The two processes would remove the
child directories successfully, then depend on the parent directory
still existing. In some cases this is an invalid assumption, because
other processes can remove the parent directory beforehand. This commit
changes deleteFile() to not upstream an error if one occurs, because the
only required error should be from the immediate deletePath, not from a
parent path.
In the specific bug report, multiple CompleteMultipartUpload requests
would launch multiple deleteFile() requests. Because they chain up on
parent directories, ultimately at the end, there would be multiple
remove files for the ultimate parent directory,
.minio.sys/multipart/{bucket}. Because only one will succeed and one
will fail, an error would be upstreamed saying that the file does not
exist, and the CompleteMultipartUpload code interpreted this as
NoSuchKey, or that the object/part id doesn't exist. This was faulty
behavior and is now fixed.
The added test fails before this change and passes after this change.
Fixes: https://github.com/minio/minio/issues/4727
This commit adds a new test for isDirEmpty (for code coverage) and
changes around the error conditional. Previously, there was a `return
nil` statement that would only be triggered under a race condition and
would trip up our test coverage for no real reason. With this new error
conditional, there's no awkward 'else'-esque condition, which means test
coverage will not change between runs for no reason in this specific
test. It's also a cleaner read.
This commit makes fsDeleteFile() simply call deleteFile() after calling
the relevant path length checking functions. This DRYs the code base.
This commit removes the Stat() call from deleteFile(). This improves
performance and removes any possibility of a race condition.
This additionally adds tests and a benchmark for said function. The
results aren't very consistent, although I'd expect this commit to make
it faster.
This commit fixes a potential security issue, whereby a full-access
token to the server would be available in the GET URL of a download
request. This fixes that issue by introducing short-expiry tokens, which
are only valid for one minute, and are regenerated for every download
request.
This commit specifically introduces the short-lived tokens, adds tests
for the tokens, adds an RPC call for generating a token given a
full-access token, updates the browser to use the new tokens for
requests where the token is passed as a GET parameter, and adds some
tests with the new temporary tokens.
Refs: https://github.com/minio/minio/pull/4673
This PR fixes the issue of cleaning up in-memory state
properly. Without this PR we can lead to security
situations where new bucket would inherit wrong
permissions on bucket and expose objects erroneously.
Fixes#4714
* Refactor HTTP server to address bugs
* Remove unnecessary goroutine to start multiple TCP listeners.
* HTTP server waits for shutdown to maximum of Server.ShutdownTimeout
than per serverShutdownPoll.
* Handles new connection errors properly.
* Handles read and write timeout properly.
* Handles error on start of HTTP server properly by exiting minio
process.
Fixes#4494#4476 & fixed review comments
This PR serves to fix following things in GCS gateway.
- fixes leaks in object reader and writer, not getting closed
under certain situations. This led to go-routine leaks.
- apparent confusing issue in case of complete multipart upload,
where it is currently possible for an entirely different
object name to concatenate parts of a different object name
if you happen to know the upload-id and parts of the object.
This is a very rare scenario but it is possible.
- succint usage of certain parts of code base and re-use.
Fixed header-to-metadat extraction. The extractMetadataFromHeader function should return an error if the http.Header contains a non-canonicalized key. The reason is that the keys can be manually set (through a map access) which can lead to ugly bugs.
Also fixed header-to-metadata extraction. Return a InternalError if a non-canonicalized key is found in a http.Header. Also log the error.
This is needed to avoid proxies buffering the connection
this is also a HTTP standard way to handle this situation
where server is sending back events in asynchronously.
For more details read https://goo.gl/RCML9f
Fixes - https://github.com/minio/minio-go/issues/731
When the browser asks for a GET presigned url, this latter is not
encoded and can be confusing when the user copies-pastes it somewhere,
especially when the path contains a space.
Current state-machine didn't honor a situation
which can arise when there is a combination of
- formatted
- unformatted
- corrupted
disks - this combination invariably goes into a
mode where all servers are waiting perpetually
forever thinking we will get quorum in future.
At this point there is a distant possibility of
ever getting a quorum since we don't even have
quorum number of disks offline.
We should exit and print a proper message per disk
to indicate what went wrong and what was detected
by the server.
Refer #4477
The ETag is constructed from md5 atttribute of object attributes
returned by the vendor's Composer. The md5 attribute comes back
as nil for large uploads. Instead the CRC32C should be used.
Refer to https://cloud.google.com/storage/docs/hashes-etagsFixes#4397
This implementation is similar to AMQP notifications:
* Notifications are published on a single topic as a JSON feed
* Topic is configurable, as is the QoS. Uses the paho.mqtt.golang
library for the mqtt connection, and supports connections over tcp
and websockets, with optional secure tls support.
* Additionally the minio server configuration has been bumped up
so mqtt configuration can be added.
* Configuration migration code is added with tests.
MQTT is an ISO standard M2M/IoT messaging protocol and was
originally designed for applications for limited bandwidth
networks. Today it's use is growing in the IoT space.
xl.storageDisks is sometimes passed to some low-level XL functions. Some disks in
xl.storageDisks are set to nil when they encounter some errors. This means all
elements in xl.storageDisks will be nil after some time which lead to an unusable XL.
Looks like if we follow pattern such as
```
_ = rlk
```
Go can potentially kick in GC and close the fd when
the reference is lost, only speculation is that
the cause here is `SetFinalizer` which is set on
`os.close()` internally in `os` stdlib.
This is unexpected and unsual endeavour for Go, but
we have to make sure the reference is never lost
and always dies with the server.
Fixes#4530
This patch also reverts previous changes which were
merged for migration to the newer disk format. We will
be bringing these changes in subsequent releases. But
we wish to add protection in this release such that
future release migrations are protected.
Revert "fs: Migration should handle bucketConfigs as regular objects. (#4482)"
This reverts commit 976870a391.
Revert "fs: Migrate object metadata to objects directory. (#4195)"
This reverts commit 76f4f20609.
isDocker was currently reading from `/proc/cgroup` file. But
this file alone is rather not conclusive evidence. Docker
internally has `.dockerenv` as a special file which we should
use instead.
Fixes#4456
Current code failed to anticipate the existence of files
which could have been created to corrupt the namespace such
as `policy.json` file created at the bucket top level.
In the current release creating such as file conflicts
with the namespace for future bucket policy operations.
We implemented migration of backend format to avoid situations
such as these.
This PR handles this situation, makes sure that the
erroneous files should have been moved properly.
Fixes#4478
Current code allowed it wrongly to generate secret key upto 100
we should only use 100 as a value to validate but for generating
it should be 40.
Fixes#4470
This makes lock RPCs similar to other RPCs where requests to the local
server bypass the network. Requests to the local lock-subsystem may
bypass the network layer and directly access the locking
data-structures.
This incidentally fixes#4451.
Currently redirection doesn't work in following scenarios
- server started with port ":80" and TLS is configured
client requested insecure request on port "80"
gets redirected to port 443 and fails.
The following commit f44f2e341c
fix was incomplete and we still had presigned URLs printing
in query strings in wrong fashion.
This PR fixes this properly. Avoid double encoding
percent encoded strings such as
`s3%!!(MISSING)A(MISSING)`
Print properly as json encoded.
`s3%3AObjectCreated%3A%2A`
Currently even when bucket doesn't exist we wrongly
return success, when an object is a directory prefix with
'/' as suffix and is of size 0.
This PR fixes this behavior.
Sending envVars along with access and secret
exposes the entire minio server's sensitive
information. This will be an unexpected
situation for all users.
If at all we need to look for things like if
credentials are set through env, we should
only have access to only this information
not the entire set of system envs.
This is an enhancement to the XL/distributed-XL mode. FS mode is
unaffected.
The ReadFileWithVerify storage-layer call is similar to ReadFile with
the additional functionality of performing bit-rot checking. It
accepts additional parameters for a hashing algorithm to use and the
expected hex-encoded hash string.
This patch provides significant performance improvement because:
1. combines the step of reading the file (during
erasure-decoding/reconstruction) with bit-rot verification;
2. limits the number of file-reads; and
3. avoids transferring the file over the network for bit-rot
verification.
ReadFile API is implemented as ReadFileWithVerify with empty hashing
arguments.
Credits to AB and Harsha for the algorithmic improvement.
Fixes#4236.
This PR also does backend format change to 1.0.1
from 1.0.0. Backward compatible changes are still
kept to read the 'md5Sum' key. But all new objects
will be stored with the same details under 'etag'.
Fixes#4312
Previous message
```
Migration from version ‘17’ to ‘18’ completed successfully.
```
For example didn't provide any meaningful insights.
This PR attempts to improve this message as below
```
Configuration file '/home/harsha/.minio/config.json' migrated from version '17' to '18' successfully.
```
Fixes#4199
This change adopts the upstream fix in this regard at
https://go-review.googlesource.com/#/c/41834/ for Minio's
purposes.
Go's current os.Stat() lacks support for lot of strange
windows files such as
- share symlinks on SMB2
- symlinks on docker nanoserver
- de-duplicated files on NTFS de-duplicated volume.
This PR attempts to incorporate the change mentioned here
https://blogs.msdn.microsoft.com/oldnewthing/20100212-00/?p=14963/
The article suggests to use Windows I/O manager to
dereference the symbolic link.
Fixes#4122
We need to have local peer initialized properly
for listen bucket to work, current code did initialize
properly but the resulting code was initializing
peer on a wrong target v/s what listen bucket expected
it to be.
This regression came in de204a0a52Fixes#4158
Avoid using `time.Now()` instead rely on UTC time
for the final deadline, this is to be consistent with
all our internal functions.
Reduce the default read timeout to 15 seconds
in lieu with a newly discovered issue
- https://github.com/minio/minio/issues/4139
Additionally also change the Read() conn wrapper
to set deadline only upon successful Reads().
Current log prints in this form
```
ERRO[8150] Lock maintenance failed to remove entry for write
lock (should never happen)%!!(MISSING)(EXTRA ....
```
Fix this by using proper formatting directive.
Duration for which a lock was held can be computed from the `Since`
field of `OpsLockState`. It is the difference between current time and
time at which the namespace lock was held. This change avoids
superfluous instrumentation.
Previous value was set to avoid large cache value build
up but we can clearly see this can cause lots of GC
pauses which can lead to significant drop in performance.
Change this value to 50% and decrease the value to 25%
once the 75% cache size is used. To have a larger
window for GC pauses.
Another change is to only allow caching if a server has
more than 24GB of RAM instead of 8GB.
Such that in a situation where all errors were
ignored we need to reduce the errors using
readQuorum to get a consistent error value.
Without this change errors generated will
never be consistent with for an expected scenario.
For example in a 6 disk setup 1 disk is missing
and 5 do not have the volume (testbucket)
Without this change Stat() would result in different
errors depending on which disk died. Can cause
confusion to S3 client application.
This change addresses need to track type of
errors we ignored and bring readQuorum to
choose the maximally occuring as the value
of truth.
getBucketInfo() should keep track errors ignored,
such that in a situation where all errors were
ignored we need to reduce the errors using readQuorum
to get a consistent error value.
This is the problem we see with DiskNotFound test
disks are randomly removed.
Fixes#4095
- Due to usage of amazon SDK, spark expects md5sum of empty string to be
returned when it does PUT on a directory.
- The fix returns md5sum of a empty string for the above mentioned case.
- This fixes the issue of Apache Spark not being able to write into Minio.
Ignore any network errors when registering a webhook
notifier during Minio startup sequence. This way server
can be started even if the webhook endpoint is not available
and unreachable.
This is to comply with S3 behavior, we previously removed
reading `fs.json` for optimization reasons but we have a
reason to believe that providing ETag and using gjson
provides needed benefit of not having to deal with
unmarshalling overhead of golang stdlib.
Fixes#4028
Values of canonicalized query resources should be unescaped before calculating
the signature. This bug is not noticed before because partNumber and uploadID
values in Minio doesn't have characters that need to be escaped.
Separate out validating v/s parsing logic in
isValidLocationConstraint() into parseLocationConstraint()
and isValidLocation()
Additionally also set `X-Amz-Bucket-Region` as part of the
common headers for the clients to fallback on in-case of any
region related errors.
Healing of buckets, objects and incomplete uploads are implemented and
available via admin REST APIs. Additionally, it is available via mc admin
sub-command. The warning is no longer relevant.
Fixes#4030
`disksUnavailable` healStatus constant indicates that a given object
needs healing but one or more of disks requiring heal are offline. This
can be used by admin heal API consumers to distinguish between a
successful heal and a no-op since the outdated disks were offline.
This change adds `access` format support for notifications to a
Elasticsearch server, and it refactors `namespace` format support.
In the case of `access` format, for each event in Minio, a JSON
document is inserted into Elasticsearch with its timestamp set to the
event's timestamp, and with the ID generated automatically by
elasticsearch. No events are modified or deleted in this mode.
In the case of `namespace` format, for each event in Minio, a JSON
document is keyed together by the bucket and object name is updated in
Elasticsearch. In the case of an object being created or over-written
in Minio, a new document or an existing document is inserted into the
Elasticsearch index. If an object is deleted in Minio, the
corresponding document is deleted from the Elasticsearch index.
Additionally, this change upgrades Elasticsearch support to the 5.x
series. This is a breaking change, and users of previous elasticsearch
versions should upgrade.
Also updates documentation on Elasticsearch notification target usage
and has a link to an elasticsearch upgrade guide.
This is the last patch that finally resolves#3928.
Do not rely on a specific cipher suite instead let the
go choose the type of cipher needed, if the connection
is coming from clients which do not support forward
secrecy let the go tls handle this automatically based
on tls1.2 specifications.
Fixes#4017
url.Parse() wrongly parses an address of format "address:port"
which is fixed in go1.8. This inculcates a breaking change
on our end. We should fix this wrong usage everywhere so that
migrating to go1.8 eventually becomes smoother.
Previously serverConfigV17 used a global lock that made any instance of
serverConfigV17 depended on single global serverConfigMu.
This patch fixes by having individual lock per instances.
This is an enhancement change to to cater support all
the data fields present on the object. Currently
we only send a subset of data which object info
provides us.
It also helps us keep a full namespace mirror on
notification targets for efficient query.
CopyObjectHandler() was incorrectly performing comparison
between destination and source object paths, which sometimes
leads to a lock race. This PR simplifies comparaison and add
one test case.
This change adds `access` format support for notifications to a Redis
server, and it refactors `namespace` format support.
In the case of `access` format, a list is used to store Minio
operations in Redis. Each entry in the list is a JSON encoded list of
two items - the first is the Minio server timestamp of the event, and
the second is an object describing the operation that created/replaced
the object in the server.
In the case of `namespace` format, a hash is used. Entries in the hash
may be updated or removed if objects in Minio are updated or deleted
respectively. The field values in the Redis hash are JSON encoded.
Also updates documentation on Redis notification target usage.
Towards resolving #3928
The following form of arguments such as
```
minio.exe -C some_dir server dir
```
has stopped working because of lack of handling of
absolute paths for config directory. Always calculate
absolute path for any relative paths on any operating
system.
The following fix converts all config directory relative
paths into absolute paths.
Fixes#3991
We can't use Content-Encoding to verify if `aws-chunked` is set
or not. Just use 'streaming' signature header instead.
While this is considered mandatory, on the contrary aws-sdk-java
doesn't set this value
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html
```
Set the value to aws-chunked.
```
We will relax it and behave appropriately. Also this PR supports
saving custom encoding after trimming off the `aws-chunked`
parameter.
Fixes#3983
* Add configuration parameter "format" for db targets and perform
configuration migration.
* Add PostgreSQL `access` format: This causes Minio to append all events
to the configured table. Prefix, suffix and event filters continue
to be supported for this mode too.
* Update documentation for PostgreSQL notification target.
* Add MySQL `access` format: It is very similar to the same format for
PostgreSQL.
* Update MySQL notification documentation.
Statically typed BrowserFlag prevents any arbitrary string value
usage. The wrapped bool marshals/unmarshals JSON according to the
typed value ie string value "on" represents boolean true and "off" as
boolean false.
This is to keep the portability and also avoid errors that
might occur using the functions written for URL resource name
Since query param values have different escaping requirements.
In the algorithm to check if an object requires healing, in addition to
checking if all disks have xl.json present we should check if all parts
of the object are present and have valid blake2b checksums.
Also fixed a minor compilation error in heal-objects-list.go.
This patch fixes below
* Previously fatalIf() never writes log other than first logging target.
* quiet flag is not honored to show progress messages other than startup messages.
* Removes console package usage for progress messages.
For listing of objects needing heal, we list all objects present on all
the disks and return the set union. We were incorrectly dropping objects
that weren't already seen in disks so far.
Sample directory layout of disks in a 4-disk setup:
`/tmp/1`, `/tmp/2`, `/tmp/3`, `/tmp/4` are directories used as disks here.
`test` is the bucket, `obj1` and obj2` are the objects.
```
/tmp/1/test
└── obj2
├── part.1
├── part.2
└── xl.json
/tmp/2/test
└── obj1
├── part.1
├── part.2
└── xl.json
/tmp/3/test
├── obj1
│ ├── part.1
│ ├── part.2
│ └── xl.json
└── obj2
├── part.1
├── part.2
└── xl.json
/tmp/4/test
[This is empty]
```
This change adds information like host, port and user-agent of the
client whose request triggered an event notification.
E.g, if someone uploads an object to a bucket using mc. If notifications
were configured on that bucket, the host, port and user-agent of mc
would be sent as part of event notification data.
Sample output:
```
"source": {
"host": "127.0.0.1",
"port": "55808",
"userAgent": "Minio (linux; amd64) minio-go/2.0.4 mc ..."
}
```
* Add a new function Save() which saves given configuration into given file.
* Simplify Load() function.
* Remove unused CheckVersion().
* CheckData() is a private function now.
* quick_test.go is part of quick package now.
* minio server uses top level quick.Load() and quick.Save() functions.
Previously, erasure backend's `listDirFactory` may return errors which
were explicitly ignored. With this change, it returns nil. Superfluous
checks at higher-layers for ignored errors are removed as well.
As a new configuration parameter is added, configuration version is
bumped up from 14 to 15.
The MySQL target's behaviour is identical to the PostgreSQL: rows are
deleted from the MySQL table on delete-object events, and are
created/updated on create/over-write events.
This API is meant for administrative tools like mc-admin to heal an
ongoing multipart upload on a Minio server. N B This set of admin
APIs apply only for Minio servers.
`github.com/minio/minio/pkg/madmin` provides a go SDK for this (and
other admin) operations. Specifically,
func HealUpload(bucket, object, uploadID string, dryRun bool) error
Sample admin API request:
POST
/?heal&bucket=mybucket&object=myobject&upload-id=myuploadID&dry-run
- Header(s): ["x-minio-operation"] = "upload"
Notes:
- bucket, object and upload-id are mandatory query parameters
- if dry-run is set, API returns success if all parameters passed are
valid.
checkURL() is a generic function to check if a passed address
is valid. This commit adds support for addresses like `m1`
and `172.16.3.1` which is needed in MySQL and NATS. This commit
also adds tests.
HEAD Object for FS and XL was returning invalid object name when
an object name has a trailing slash separator, this PR changes the
behavior and will always return 404 object not found, this guarantees
a better compatibility with S3 spec.
This change is cleanup of the postPolicyHandler code
primarily to address the flow and also converting
certain critical parts into self contained functions.
It was possible to upload a big file which overcomes the minimal
disk space limit in XL, PrepareFile was actually checking for disk
space but we weren't checking its returned error. This patch fixes
this behavior.
* fs: Rename tempObjPath variable in fsCreateFile()
* fs/posix: Factor checkDiskFree() function
* fs: Add disk free check in fsCreateFile()
* posix: Move free disk check to createFile()
* xl: Relax free disk check in POSIX initialization
* fs: checkDiskFree checks for space to store data
This improves the startup time significantly
for clusters which have lot of buckets.
Also fixes a bug where `.minio.sys` is created
on disks which do not have `format.json`
startOffset was re-assigned to '0' so it would end up
copying wrong content ignoring the requested startOffset.
This also fixes the corruption issue we observed while
using docker registry.
Fixes https://github.com/docker/distribution/issues/2205
Also fixes#3842 - incorrect routing.
The globalMaxObjectSize limit is instilled in S3 spec perhaps
due to certain limitations on S3 infrastructure. For minio we
don't have such limitations and we can stream a larger file
instead.
So we are going to bump this limit to 16GiB.
Fixes#3825
This function was returning BucketNotFound for all errors
which at least hides the fact that disks could be corrupted.
This commit fixes the behavior by returning all errors that,
are, by the way, Object API errors.
Add missing protection from deleting multiple objects
in parallel. Currently we are deleting objects without
proper locking through this API.
This can cause significant amount of races.
Ignore a disk which wasn't able to successfully perform an action to
avoid eventual perturbations when the disk comes back in the middle
of write change.
This removal comes to avoid some redundant requirements
which are adding more problems on a production setup.
Here are the list of checks for time as they happen
- Fresh connect (during server startup) - CORRECT
- A reconnect after network disconnect - CORRECT
- For each RPC call - INCORRECT.
Verifying time for each RPC aggravates a situation
where a RPC call is rejected in a sequence of events
due to enough load on a production setup. 3 second
might not be enough time window for the call to be
initiated and received by the server.
Currently we document as IP:PORT which doesn't provide
if someone can use HOSTNAME:PORT. This is a change
to clarify this by calling it as ADDRESS:PORT which
encompasses both a HOSTNAME and an IP.
Fixes#3799
This PR is for readability cleanup
- getOrderedDisks as shuffleDisks
- getOrderedPartsMetadata as shufflePartsMetadata
Distribution is now a second argument instead being the
primary input argument for brevity.
Also change the usage of type casted int64(0), instead
rely on direct type reference as `var variable int64` everywhere.
Existing objects before overwrites are renamed to
temp location in completeMultipart. We make sure
that we delete it even if subsequenty calls fail.
Additionally move verifying of parent dir is a
file earlier to fail the entire operation.
Ref #3784
Content-Encoding is set to "aws-chunked" which is an S3 specific
API value which is no meaning for an object. This is how S3
behaves as well for a streaming signature uploaded object.
Make sure to skip reserved bucket names in `ListBuckets()`
current code didn't skip this properly and also generalize
this behavior for both XL and FS.
This is an attempt cleanup code and keep the top level config
functions simpler and easy to understand where as move the
notifier related code and logger setter/getter methods as part
of their own struct.
Locks are now held properly not globally by configMutex, but
instead as private variables.
Final fix for #3700
Also changes the behavior of `secretKeyHash` which is
not necessary to be sent over the network, each node
has its own secretKeyHash to validate.
Fixes#3696
Partial(fix) #3700 (More changes needed with some code cleanup)
Currently the auth rpc client defaults to to a maximum
cap of 30seconds timeout. Make this to be configurable
by the caller of authRPCClient during initialization, if no
such config is provided then default to 30 seconds.
Ideally here if the interface is not found it would
fail the server, as it should be because without these
we can't even have a working server in the first place.
Just like how it fails in master invariably inside Go
net/http code path.
Fixes#3708
Network: total bytes of incoming and outgoing server's data
by taking advantage of our ConnMux Read/Write wrapping
HTTP: total number of different http verbs passed in http
requests and different status codes passed in http responses.
This is counted in a new http handler.
Resource strings and paths are case insensitive on windows
deployments but if user happens to use upper case instead of
lower case for certain configuration params like bucket
policies and bucket notification config. We might not honor
them which leads to a wrong behavior on windows.
This is windows only behavior, for all other platforms case
is still kept sensitive.
Avoid passing size = -1 to PutObject API by requiring content-length
header in POST request (as AWS S3 does) and in Upload web handler.
Post handler is modified to completely store multipart file to know
its size before sending it to PutObject().
Following is a sample list lock API request schematic,
/?lock&bucket=mybucket&prefix=myprefix&duration=holdDuration
x-minio-operation: list
The response would contain the list of locks held on mybucket matching
myprefix for a duration longer than holdDuration.
Current implementation didn't honor quorum properly and didn't
handle the errors generated properly. This patch addresses that
and also moves common code `cleanupMultipartUploads` into xl
specific private function.
Fixes#3665
On macOS, if a process already listens on 127.0.0.1:PORT, net.Listen() falls back
to IPv6 address ie minio will start listening on IPv6 address whereas another
(non-)minio process is listening on IPv4 of given port.
To avoid this error sutiation we check for port availability only for macOS.
Note: checkPortAvailability() tries to listen on given port and closes it.
It is possible to have a disconnected client in this tiny window of time.
Creds don't require secretKeyHash to be calculated
everytime, cache it instead and re-use.
This is an optimization for bcrypt.
Relevant results from the benchmark done locally, negative
value means improvement in this scenario.
```
benchmark old ns/op new ns/op delta
BenchmarkAuthenticateNode-4 160590992 80125647 -50.11%
BenchmarkAuthenticateWeb-4 160556692 80432144 -49.90%
benchmark old allocs new allocs delta
BenchmarkAuthenticateNode-4 87 75 -13.79%
BenchmarkAuthenticateWeb-4 87 75 -13.79%
benchmark old bytes new bytes delta
BenchmarkAuthenticateNode-4 15222 9785 -35.72%
BenchmarkAuthenticateWeb-4 15222 9785 -35.72%
```
An external test that runs cmd.Main() has a difficulty to set cmd arguments
and MINIO_{ACCESS,SECRET}_KEY values, this commit changes a little the current
behavior in a way that helps external tests.
Encode the path of the passed presigned url before calculating the signature. This fixes
presigning objects whose names contain characters that are found encoded in urls.
* Implement heal format REST API handler
* Implement admin peer rpc handler to re-initialize storage
* Implement HealFormat API in pkg/madmin
* Update pkg/madmin API.md to incl. HealFormat
* Added unit tests for ReInitDisks rpc handler and HealFormatHandler
For TLS peekProtocol do not assume the incoming request to be a TLS
connection perform a handshake() instead and validate.
Also add some security related defaults to `tls.Config`.
This restriction has lots of side affects, since
we do not have a mechanism to clear states like
this it is better not to keep them.
Network errors are common and can occur with
simple cable removal etc. Since we already have
a retry mechanism this error count and stateful
nature can bring problems on a long running
cluster.
This is a consolidation effort, avoiding usage
of naked strings in codebase. Whenever possible
use constants which can be repurposed elsewhere.
This also fixes `goconst ./...` reported issues.
`principalId` i.e user identity is kept as AccessKey in
accordance with S3 spec.
Additionally responseElements{} are added starting with
`x-amz-request-id` is a hexadecimal of the event time itself in nanosecs.
`x-minio-origin-server` - points to the server generating the event.
Fixes#3556
URL paths can be empty and not have preceding separator,
we do not yet know the conditions this can happen inside
Go http server.
This patch is to ensure that we do not crash ourselves
under conditions where r.URL.Path may be empty.
Fixes#3553
A client sends escaped characters in values of some query parameters in a presign url.
This commit properly unescapes queires to fix signature calculation.
Golang HTTP client automatically detects content-type but
for S3 clients this content-type might be incorrect or
might misbehave.
For example:
```
Content-Type: text/xml; charset=utf-8
```
Should be
```
Content-Type: application/xml
```
Allow this to be set properly.
* Filter lock info based on bucket, prefix and time since lock was held
* Implement list and clear locks REST API
* madmin: Add list and clear locks API
* locks: Clear locks matching bucket, prefix, relTime.
* Gather lock information across nodes for both list and clear locks admin REST API.
* docs: Add lock API to management APIs
* Rename GenericArgs to AuthRPCArgs
* Rename GenericReply to AuthRPCReply
* Remove authConfig.loginMethod and add authConfig.ServiceName
* Rename loginServer to AuthRPCServer
* Rename RPCLoginArgs to LoginRPCArgs
* Rename RPCLoginReply to LoginRPCReply
* Version and RequestTime are added to LoginRPCArgs and verified by
server side, not client side.
* Fix data race in lockMaintainence loop.
This patch uses a technique where in a retryable storage
before object layer initialization has a higher delay
and waits for longer period upto 4 times with time unit
of seconds.
And uses another set of configuration after the disks
have been formatted, i.e use a lower retry backoff rate
and retrying only once per 5 millisecond.
Network IO error count is reduced to a lower value i.e 256
before we reject the disk completely. This is done so that
combination of retry logic and total error count roughly
come to around 2.5secs which is when we basically take the
disk offline completely.
NOTE: This patch doesn't fix the issue of what if the disk
is completely dead and comes back again after the initialization.
Such a mutating state requires a change in our startup sequence
which will be done subsequently. This is an interim fix to alleviate
users from these issues.
Implement a storage rpc specific rpc client,
which does not reconnect unnecessarily.
Instead reconnect is handled at a different
layer for storage alone.
Rest of the calls using AuthRPC automatically
reconnect, i.e upon an error equal to `rpc.ErrShutdown`
they dial again and call the requested method again.
Attempt a reconnect also if disk not found.
This is needed since any network operation error
is converted to disk not found but we also need
to make sure if disk is really not available.
Additionally we also need to retry more than
once because the server might be in startup
sequence which would render other servers to
wrongly think that the server is offline.
This is written so that to simplify our handler code
and provide a way to only update metadata instead of
the data when source and destination in CopyObject
request are same.
Fixes#3316
- Add a lockStat type to group counters
- Remove unnecessary helper functions
- Fix stats computation on force unlock
- Removed unnecessary checks and cleaned up comments
This is to utilize an optimized version of
sha256 checksum which @fwessels implemented.
blake2b lacks such optimizations on ARM platform,
this can provide us significant boost in performance.
blake2b on ARM64 as expected would be slower.
```
BenchmarkSize1K-4 30000 44015 ns/op 23.26 MB/s
BenchmarkSize8K-4 5000 335448 ns/op 24.42 MB/s
BenchmarkSize32K-4 1000 1333960 ns/op 24.56 MB/s
BenchmarkSize128K-4 300 5328286 ns/op 24.60 MB/s
```
sha256 on ARM64 is faster by orders of magnitude giving close to
AVX performance of blake2b.
```
BenchmarkHash8Bytes-4 1000000 1446 ns/op 5.53 MB/s
BenchmarkHash1K-4 500000 3229 ns/op 317.12 MB/s
BenchmarkHash8K-4 100000 14430 ns/op 567.69 MB/s
BenchmarkHash1M-4 1000 1640126 ns/op 639.33 MB/s
```
ObjectLayer GetObject() now returns the entire object
if starting offset is 0 and length is negative. This
also allows to simplify handler layer code where
we always had to use GetObjectInfo() before proceeding
to read bucket metadata files examples `policy.json`.
This also reduces one additional call overhead.
success_action_redirect in the sent Form means that the server needs to return 303 in addition to a well specific redirection url, this commit adds this feature
This is important in a distributed setup, where the server hosting the
first disk formats a fresh setup. Sorting ensures that all servers
arrive at the same 'first' server.
Note: This change doesn't protect against different disk arguments
with some disks being same across servers.
Previously, more than one goroutine calls RPCClient.dial(), each
goroutine gets a new rpc.Client but only one such client is stored
into RPCClient object. This leads to leaky connection at the server
side. This is fixed by taking lock at top of dial() and release on
return.
There was an error in how we validated disk formats,
if one of the disk was formatted and was formatted with
FS would cause confusion and object layer would never
initialize essentially go into an infinite loop.
Validate pre-emptively and also check for FS format
properly.
This is implemented so that the issues like in the
following flow don't affect the behavior of operation.
```
GetObjectInfo()
.... --> Time window for mutation (no lock held)
.... --> Time window for mutation (no lock held)
GetObject()
```
This happens when two simultaneous uploads are made
to the same object the object has returned wrong
info to the client.
Another classic example is "CopyObject" API itself
which reads from a source object and copies to
destination object.
Fixes#3370Fixes#2912
FS/Multipart: Fix race between PutObjectPart and Complete/Abort multipart. close(timeoutCh) on complete/abort so that a racing PutObjectPart does not leave a dangling go-routine.
Fixes#3351
This change brings in changes at multiple places
- Reuse buffers at almost all locations ranging
from rpc, fs, xl, checksum etc.
- Change caching behavior to disable itself
under low memory conditions i.e < 8GB of RAM.
- Only objects cached are of size 1/10th the size
of the cache for example if 4GB is the cache size
the maximum object size which will be cached
is going to be 400MB. This change is an
optimization to cache more objects rather
than few larger objects.
- If object cache is enabled default GC
percent has been reduced to 20% in lieu
with newly found behavior of GC. If the cache
utilization reaches 75% of the maximum value
GC percent is reduced to 10% to make GC
more aggressive.
- Do not use *bytes.Buffer* due to its growth
requirements. For every allocation *bytes.Buffer*
allocates an additional buffer for its internal
purposes. This is undesirable for us, so
implemented a new cappedWriter which is capped to a
desired size, beyond this all writes rejected.
Possible fix for #3403.
- This is to ensure that the any new config references made to the
serverConfig is also backed by a mutex lock.
- Otherwise any new config assigment will also replace the member mutex
which is currently used for safe access.
EOF err message in Peek Protocol is shown when a client closes the
connection in the middle of peek protocol, this commit hides it since it
doesn't make sense to show it
Current code always appends to a file only if 1byte or
more was sent on the wire was affecting both PutObject
and PutObjectPart uploads.
This patch fixes such a situation and resolves#3385
backgroundAppend type's abort method should wait for appendParts to finish
writing ongoing appending of parts in the background before cleaning up
the part files.
Previously minio server expects content-length-range values as integer
in JSON. However Amazon S3 handles content-length-range values as
integer and strings.
This patch adds support for string values.
Since we moved out reconnection logic from net-rpc-client.go
we should do it from the top-layer properly and bring back
the code to reconnect properly in-case the connection is lost.
setGlobalsFromContext() is added to set global variables after parsing
command line arguments. Thus, global flags will be honored wherever
they are placed in minio command.
Make sure all S3 signature requests are not re-directed
to `/minio`. This should be only done for JWT and some
Anonymous requests.
This also fixes a bug found from https://github.com/bji/libs3
```
$ s3 -u list
ERROR: XmlParseFailure
```
Now after this fix shows proper output
```
$ s3 -u list
Bucket Created
-------------------------------------------------------- --------------------
andoria 2016-11-27T08:19:06Z
```
It would make sense to enable logger just after config initialisation.
That way, errorIf() and fatalIf() will be usable and can catch error
like invalid access and key errors.
This patch brings in changes from miniobrowser repo.
- Bucket policy UI and functionality fixes by @krishnasrinivas
- Bucket policy implementation by @balamurugana
- UI changes and new functionality changing password etc. @rushenn
- UI and new functionality for sharing URLs, deleting files
@rushenn and @krishnasrinivas.
- Other misc fixes by @vadmeste @brendanashworth
This is needed to validate if the `format.json` indeed exists
when a fresh node is brought online.
This wrapped implementation also connects to the remote node
by attempting a re-login. Subsequently after a successful
connect `format.json` is validated as well.
Fixes#3207
logurs is not helping us to set different log formats (json/text) to
different loggers. Now, we create different logurs instances and call
them in errorIf and fatalIf
This change adds more richer error response
for JSON-RPC by interpreting object layer
errors to corresponding meaningful errors
for the web browser.
```go
&json2.Error{
Message: "Bucket Name Invalid, Only lowercase letters, full stops, and numbers are allowed.",
}
```
Additionally this patch also allows PresignedGetObject()
to take expiry parameter to have variable expiry.
XL multipart fails to remove tmp files when an error occurs during upload, this case covers the scenario where an upload is canceled manually by the client in the middle of job.
content-length-range policy in postPolicy API was
not working properly handle it. The reflection
strategy used has changed in recent version of Go.
Any free form interface{} of any integer is treated
as `float64` this caused a bug where content-length-range
parsing failed to provide any value.
Fixes#3295
This is needed as explained by @krisis
Lets say we have following errors.
```
[]error{nil, errFileNotFound, errDiskAccessDenied, errDiskAccesDenied}
```
Since the last two errors are filtered, the maximum is nil,
depending on map order.
Let's say we get nil from reduceErr. Clearly at this point
we don't have quorum nodes agreeing about the data and since
GetObject only requires N/2 (Read quorum) and isDiskQuorum
would have returned true. This is problematic and can lead to
undersiable consequences.
Fixes#3298
Disks when are offline for a long period of time, we should
ignore the disk after trying Login upto 5 times.
This is to reduce the network chattiness, this also reduces
the overall time spent on `net.Dial`.
Fixes#3286
For binary releases and operating systems it would be
All operating systems.
```
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Minio is 25 days 12 hours 30 minutes old ┃
┃ Update: https://dl.minio.io/server/minio/release/linux-amd64/minio ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
```
On docker.
```
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Minio is 25 days 12 hours 32 minutes old ┃
┃ Update: docker pull minio/minio ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
```
In a situation when we have lots of buckets the bootup time
might have slowed down a bit but during this situation the
servers quickly going up and down would be an in-transit state.
Certain calls which do not use quorum like `readXLMetaStat`
might return an error saying `errDiskNotFound` this is returned
in place of expected `errFileNotFound` which leads to an issue
where server doesn't start.
To avoid this situation we need to ignore them as safe values
to be ignored, for the most part these are network related errors.
Fixes#3275
Also fix test to not use a bucket name with a leading slash - this
causes the bucket name to become empty and go to an unintended API
call (listbuckets).
This patch fixes a possible bug, reproduced rarely only seen
once.
```
panic: runtime error: index out of range
goroutine 136 [running]:
panic(0xac1a40, 0xc4200120b0)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/minio/minio/vendor/github.com/minio/dsync.lock.func1(0xc4203d2240, 0x4, 0xc420474080, 0x4, 0x4, 0xc4202abb60, 0x0, 0xa86d01, 0xefcfc0, 0xc420417a80)
/go/src/github.com/minio/minio/vendor/github.com/minio/dsync/drwmutex.go:170 +0x69b
created by github.com/minio/minio/vendor/github.com/minio/dsync.lock
/go/src/github.com/minio/minio/vendor/github.com/minio/dsync/drwmutex.go:191 +0xf4
```
Ref #3229
After review with @abperiasamy we decided to remove all the unnecessary options
- MINIO_BROWSER (Implemented as a security feature but now deemed obsolete
since even if blocking access to MINIO_BROWSER, s3 API port is open)
- MINIO_CACHE_EXPIRY (Defaults to 72h)
- MINIO_MAXCONN (No one used this option and we don't test this)
- MINIO_ENABLE_FSMETA (Enable FSMETA all the time)
Remove --ignore-disks option - this option was implemented when XL layer
would initialize the backend disks and heal them automatically to disallow
XL accidentally using the root partition itself this option was introduced.
This behavior has been changed XL no longer automatically initializes
`format.json` a HEAL is controlled activity, so ignore-disks is not
useful anymore. This change also addresses the problems of our documentation
going forward and keeps things simple. This patch brings in reduction of
options and defaulting them to a valid known inputs. This patch also
serves as a guideline of limiting many ways to do the same thing.
rpcClient should attempt a reconnect if the call fails
with 'rpc.ErrShutdown' this is needed since at times when
the servers are taken down and brought back up.
The hijacked connection from net.Dial is usually closed.
So upon first attempt rpcClient might falsely indicate that
disk to be down, to avoid this state make another dial attempt
to really fail.
Fixes#3206Fixes#3205
- abstract out instrumentation information.
- use separate lockInstance type that encapsulates the nsMutex, volume,
path and opsID as the frontend or top-level lock object.
This is done by not making the methods of the BucketMetaState interface
as methods (via type nesting) on the type implementing
RPCs (s3PeerAPIHandlers).
- Adds an interface to update in-memory bucket metadata state called
BucketMetaState - this interface has functions to:
- update bucket notification configuration,
- bucket listener configuration,
- bucket policy configuration, and
- send bucket event
- This interface is implemented by `localBMS` a type for manipulating
local node in-memory bucket metadata, and by `remoteBMS` a type for
manipulating remote node in-memory bucket metadata.
- The remote node interface, makes an RPC call, but the local node
interface does not - it updates in-memory bucket state directly.
- Rename mkPeersFromEndpoints to makeS3Peers and refactored it.
- Use arrayslice instead of map in s3Peers struct
- `s3Peers.SendUpdate` now receives an arrayslice of peer indexes to
send the request to, with a special nil value slice indicating that
all peers should be sent the update.
- `s3Peers.SendUpdate` now returns an arrayslice of errors, representing
errors from peers when sending an update. The array positions
correspond to peer array s3Peers.peers
Improve globalS3Peers:
- Make isDistXL a global `globalIsDistXL` and remove from s3Peers
- Make globalS3Peers an array of (address, bucket-meta-state) pairs.
- Fix code and tests.
Default golang net.Listen only listens on the first IP when
host resolves to multiple IPs.
This change addresses a problem for example your ``/etc/hosts``
has entries as following
```
127.0.1.1 minio1
192.168.1.10 minio1
```
Trying to start minio as
```
minio server --address "minio1:9001" ~/Photos
```
Causes the minio server to be bound only to "127.0.1.1" which
is an incorrect behavior since we are generally interested in
`192.168.1.10` as well.
This patch addresses this issue if the hostname is resolvable
and gives back list of addresses associated with that hostname
we just bind on all of them as it is the expected behavior.
- The benchmark initialization function was not taking into account the
instance type (FS/XL), was using XL ObjectLayer even for FS
benchmarks.
- This was leading to incorrect benchmark results for FS related
benchmarks.
- The fix takes into account the instance type (FS/XL) and correctly
returns FS backend for FS benchmarks.
Do not attempt to fetch volume/drive information for
each i/o situation. In our case we do this in all calls
`posix.go` this in-turn created a terrible situation for
windows. This issue does not affect the i/o path on Unix
platforms since statvfs calls are in the range of micro
seconds on these platforms.
This verification is only needed during startup and we
let things fail at a later stage on windows.
- Reads and writes of uploads.json in XL now uses quorum for
newMultipart, completeMultipart and abortMultipart operations.
- Each disk's `uploads.json` file is read and updated independently for
adding or removing an upload id from the file. Quorum is used to
decide if the high-level operation actually succeeded.
- Refactor FS code to simplify the flow, and fix a bug while reading
uploads.json.
For command line arguments we are currently following
- <node-1>:/path ... <node-n>:/path
This patch changes this to
- http://<node-1>/path ... http://<node-n>/path
In FS or single-node XL mode, there is no need to save listener
configuration to persistent storage. As there is only one server, if it
is restarted, any connected listenBucketAPI clients were disconnected
and will have to reconnect - so there is nothing to actually store.
This incidentally solves #3052 by avoiding the problem.
- When modifying notification configuration
- When modifying listener configuration
- When modifying policy configuration
With this change we also stop early checking if the bucket exists, since
that uses a Read-lock and causes a deadlock due to the outer Write-lock.
opsID, a variable on the stack, changes over the course of
Completemultipartupload function in xl-v1-multipart.go. This was
being used in a function closure which was passed to defer
statement. The variables used in the closure depend on their values at
the time of evaluation which is indeterminate behaviour. It is
incorrect to depend on values of variables on stack at the end of
function, when deferred functions are executed.