fix: allow certain mutation on objects during decommission
currently by mistake deletion of objects was skipped,
if the object resided on the pool being decommissioned.
delete's are okay to be allowed since decommission is
designed to run on a cluster with active I/O.
Small uploads spend a significant amount of time (~5%) fetching disk info metrics. Also maps are allocated for each call.
Add a 100ms cache to disk metrics.
versioned buckets were not creating the delete markers
present in the versioned stack of an object, this essentially
would stop decommission to succeed.
This PR fixes creating such delete markers properly during
a decommissioning process, adds tests as well.
Current code incorrectly passed the
config asset object name while decommissioning,
make sure that we pass the right object name
to be hashed on the newer set of pools.
This PR fixes situations after a successful
decommission, the users and policies might go
missing due to wrong hashed set.
also use designated names for internal
calls
- storageREST calls are storageR
- lockREST calls are lockR
- peerREST calls are just peer
Named in this fashion to facilitate wildcard matches
by having prefixes of the same name.
Additionally, also enable funcNames for generic handlers
that return errors, currently we disable '<unknown>'
In a replicated setup, when an object is updated in one cluster but
still waiting to be replicated to the other cluster, GET requests with
if-match, and range headers will likely fail. It is better to proxy
requests instead.
Also, this commit avoids printing verbose logs about precondition &
range errors.
fix: change timedvalue to return previous cached value
caller can interpret the underlying error and decide
accordingly, places where we do not interpret the
errors upon timedValue.Get() - we should simply use
the previously cached value instead of returning "empty".
Bonus: remove some unused code
Add a generic handler that adds a new tracing context to the request if
tracing is enabled. Other handlers are free to modify the tracing
context to update information on the fly, such as, func name, enable
body logging etc..
With this commit, requests like this
```
curl -H "Host: ::1:3000" http://localhost:9000/
```
will be traced as well.
Directories markers are not healed when healing a new fresh disk. A
a proper fix would be moving object names encoding/decoding to erasure
object level but it is too late now since the object to set distribution is
calculated at a higher level.
It is observed in a local 8 drive system the CPU seems to be
bottlenecked at
```
(pprof) top
Showing nodes accounting for 1385.31s, 88.47% of 1565.88s total
Dropped 1304 nodes (cum <= 7.83s)
Showing top 10 nodes out of 159
flat flat% sum% cum cum%
724s 46.24% 46.24% 724s 46.24% crypto/sha256.block
219.04s 13.99% 60.22% 226.63s 14.47% syscall.Syscall
158.04s 10.09% 70.32% 158.04s 10.09% runtime.memmove
127.58s 8.15% 78.46% 127.58s 8.15% crypto/md5.block
58.67s 3.75% 82.21% 58.67s 3.75% github.com/minio/highwayhash.updateAVX2
40.07s 2.56% 84.77% 40.07s 2.56% runtime.epollwait
33.76s 2.16% 86.93% 33.76s 2.16% github.com/klauspost/reedsolomon._galMulAVX512Parallel84
8.88s 0.57% 87.49% 11.56s 0.74% runtime.step
7.84s 0.5% 87.99% 7.84s 0.5% runtime.memclrNoHeapPointers
7.43s 0.47% 88.47% 22.18s 1.42% runtime.pcvalue
```
Bonus changes:
- re-use transport for bucket replication clients, also site replication clients.
- use 32KiB buffer for all read and writes at transport layer seems to help
TLS read connections.
- Do not have 'MaxConnsPerHost' this is problematic to be used with net/http
connection pooling 'MaxIdleConnsPerHost' is enough.
- Always reformat all disks when a new disk is detected, this will
ensure new uploads to be written in new fresh disks
- Always heal all buckets first when an erasure set started to be healed
- Use a lock to prevent two disks belonging to different nodes but in
the same erasure set to be healed in parallel
- Heal different sets in parallel
Bonus:
- Avoid logging errUnformattedDisk when a new fresh disk is inserted but
not detected by healing mechanism yet (10 seconds lag)
site replication errors were printed at
various random locations, repeatedly - this
PR attempts to remove double logging and
capture all of them at a common place.
This PR also enhances the code to show
partial success and errors as well.
`mc admin heal -r <alias>` in a multi setup pools returns incorrectly
grey objects. The reason is that erasure-server-pools.HealObject() runs
HealObject in all pools and returns the result of the first nil
error. However, in the lower erasureObject level, HealObject() returns
nil if an object does not exist + missing error in each disk of the object
in that pool, therefore confusing mc.
Make erasureObject.HealObject() to return not found error in the lower
level, so at least erasureServerPools will know what pools to ignore.
`config.ResolveConfigParam` returns the value of a configuration for any
subsystem based on checking env, config store, and default value. Also returns info
about which config source returned the value.
This is useful to return info about config params overridden via env in the user
APIs. Currently implemented only for OpenID subsystem, but will be extended for
others subsequently.
If sending a white space during a long S3 handler call fails,
the whitespace goroutine forgets to return a result to the caller.
Therefore, the complete multipart handler will be blocked.
Remember to send the header written result to the caller
or/and close the channel.
- currently subnet health check was freezing and calling
locks at multiple locations, avoid them.
- throw errors if first attempt itself fails with no results
Erasure SD DeleteObjects() is only inheriting bucket versioning status
from the handler layer.
Add the missing versioning prefix evaluation for each object that will
deleted.
PR #15052 caused a regression, add the missing metrics back.
Bonus:
- internode information should be only for distributed setups
- update the dashboard to include 4xx and 5xx error panels.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
Currently minio_s3_requests_errors_total covers 4xx and
5xx S3 responses which can be confusing when s3 applications
sent a lot of HEAD requests with obvious 404 responses or
when the replication is enabled.
Add
- minio_s3_requests_4xx_errors_total
- minio_s3_requests_5xx_errors_total
to help users monitor 4xx and 5xx HTTP status codes separately.
peerOnlineCounter was making NxN calls to many peers, this
can be really long and tedious if there are random servers
that are going down.
Instead we should calculate online peers from the point of
view of "self" and return those online and offline appropriately
by performing a healthcheck.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
PR #15041 fixed replicating 'null' version however
due to a regression from #14994 caused the target
versions for these 'null' versioned objects to have
different 'versions', this may cause confusion with
bi-directional replication and cause double replication.
This PR fixes this properly by making sure we replicate
the correct versions on the objects.
mergeEntryChannels has the potential to perpetually
wait on the results channel, context might be closed
and we did not honor the caller context canceling.
The S3 service can be frozen indefinitely if a client or mc asks for object
perf API but quits early or has some networking issues. The reason is
that partialWrite() can block indefinitely.
This commit makes partialWrite() listens to context cancellation as well. It
also renames deadlinedCtx to healthCtx since it covers handler context
cancellation and not only not only the speedtest deadline.
In a streaming response, the client knows the size of a streamed
message but never checks the message size. Add the check to error
out if the response message is truncated.
Indexed streams would be decoded by the legacy loader if there
was an error loading it. Return an error when the stream is indexed
and it cannot be loaded.
Fixes "unknown minor metadata version" on corrupted xl.meta files and
returns an actual error.
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
readAllXL would return inlined data for outdated disks
causing "read" to return incorrect content to the client,
this PR fixes this behavior by making sure we skip such
outdated disks appropriately based on the latest ModTime
on the disk.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
It would seem like the PR #11623 had chewed more
than it wanted to, non-fips build shouldn't really
be forced to use slower crypto/sha256 even for
presumed "non-performance" codepaths. In MinIO
there are really no "non-performance" codepaths.
This assumption seems to have had an adverse
effect in certain areas of CPU usage.
This PR ensures that we stick to sha256-simd
on all non-FIPS builds, our most common build
to ensure we get the best out of the CPU at
any given point in time.
- Adds an STS API `AssumeRoleWithCustomToken` that can be used to
authenticate via the Id. Mgmt. Plugin.
- Adds a sample identity manager plugin implementation
- Add doc for plugin and STS API
- Add an example program using go SDK for AssumeRoleWithCustomToken
this PR also fixes a situation where incorrect
partsMetadata slice was used where fi.Data was
re-used from a single drive causing duplication
of the shards across all drives.
This happens for situations where shouldHeal()
returns true for all drives > parityBlocks.
To avoid this we should never attempt to heal on all
drives > parityBlocks, unless we are doing metadata
migration from xl.json -> xl.meta
If one or more pools reach 85% usage in a set, we will only
use pools that have more free space.
In case all pools are above 85% we allow all of them to be used
with the regular distribution.
When a server pool with a different number of sets is added they are
not compensated when choosing a destination pool for new objects.
This leads to the unbalanced placement of objects with smaller pools
getting a bigger number of objects since we only compare the destination
sets directly.
This change will compensate for differences in set sizes when choosing
the destination pool.
Different set sizes are already compensated by fewer disks.
updating metadata with CopyObject on a versioned bucket
causes the latest version to be not readable, this PR fixes
this properly by handling the inline data bug fix introduced
in PR #14780.
This bug affects only inlined data.
* Do not use inline data size in xl.meta quorum calculation
Data shards of one object can different inline/not-inline decision
in multiple disks. This happens with outdated disks when inline
decision changes. For example, enabling bucket versioning configuration
will change the small file threshold.
When the parity of an object becomes low, GET object can return 503
because it is not unable to calculate the xl.meta quorum, just because
some xl.meta has inline data and other are not.
So this commit will be disable taking the size of the inline data into
consideration when calculating the xl.meta quorum.
* Add tests for simulatenous inline/notinline object
Co-authored-by: Anis Elleuch <anis@min.io>
current implementation relied on recursively calling one bucket
at a time across all peers, this would be very slow and chatty
when there are 100's of buckets which would mean 100*peerCount
amount of network operations.
This PR attempts to reduce this entire call into `peerCount`
amount of network calls only. This functionality addresses also a
concern where the Prometheus metrics would significantly slow
down when one of the peers is offline.