Add a new endpoint for "resource" metrics `/v2/metrics/resource`
This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.
Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.
Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
This PR changes the handling of bucket deletes for site
replicated setups to hold on to deleted bucket state until
it syncs to all the clusters participating in site replication.
A huge number of goroutines would build up from various monitors
When creating test filesystems provide a context so they can shut down when no longer needed.
It is observed in a local 8 drive system the CPU seems to be
bottlenecked at
```
(pprof) top
Showing nodes accounting for 1385.31s, 88.47% of 1565.88s total
Dropped 1304 nodes (cum <= 7.83s)
Showing top 10 nodes out of 159
flat flat% sum% cum cum%
724s 46.24% 46.24% 724s 46.24% crypto/sha256.block
219.04s 13.99% 60.22% 226.63s 14.47% syscall.Syscall
158.04s 10.09% 70.32% 158.04s 10.09% runtime.memmove
127.58s 8.15% 78.46% 127.58s 8.15% crypto/md5.block
58.67s 3.75% 82.21% 58.67s 3.75% github.com/minio/highwayhash.updateAVX2
40.07s 2.56% 84.77% 40.07s 2.56% runtime.epollwait
33.76s 2.16% 86.93% 33.76s 2.16% github.com/klauspost/reedsolomon._galMulAVX512Parallel84
8.88s 0.57% 87.49% 11.56s 0.74% runtime.step
7.84s 0.5% 87.99% 7.84s 0.5% runtime.memclrNoHeapPointers
7.43s 0.47% 88.47% 22.18s 1.42% runtime.pcvalue
```
Bonus changes:
- re-use transport for bucket replication clients, also site replication clients.
- use 32KiB buffer for all read and writes at transport layer seems to help
TLS read connections.
- Do not have 'MaxConnsPerHost' this is problematic to be used with net/http
connection pooling 'MaxIdleConnsPerHost' is enough.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
This PR simply adds a warning message when it detects older kernel
versions and warn's them about potential performance issues on this
kernel.
The issue can be seen only with parallel I/O across all drives
on denser setups such as 90 drives or 45 drives per server configurations.
startup speed-up, currently getFormatErasureInQuorum()
would spend up to 2-3secs when there are 3000+ drives
for example in a setup, simplify this implementation
to use drive counts.
this helps in caching the resolved values early on, avoids
causing further resolution for individual nodes when
object layer comes online.
this can speed up our startup time during, upgrades etc by
an order of magnitude.
additional changes in connectLoadInitFormats() and parallelize
all calls that might be potentially blocking.
Large clusters with multiple sets, or multi-pool setups at times might
fail and report unexpected "file not found" errors. This can become
a problem during startup sequence when some files need to be created
at multiple locations.
- This PR ensures that we nil the erasure writers such that they
are skipped in RenameData() call.
- RenameData() doesn't need to "Access()" calls for `.minio.sys`
folders they always exist.
- Make sure PutObject() never returns ObjectNotFound{} for any
errors, make sure it always returns "WriteQuorum" when renameData()
fails with ObjectNotFound{}. Return appropriate errors for all
other cases.
It is possible that GetLock() call remembers a previously
failed releaseAll() when there are networking issues, now
this state can have potential side effects.
This PR tries to avoid this side affect by making sure
to initialize NewNSLock() for each GetLock() attempts
made to avoid any prior state in the memory that can
interfere with the new lock grants.