Commit Graph

12038 Commits

Author SHA1 Message Date
Klaus Post
3415c4dd1e
Fix reconnected deadlock with full queue (#19964)
When a reconnection happens, `handleMessages` must be able to complete and exit. 

This can be prevented in a full queue.

Deadlock chain (May 10th release)

```
1 @ 0x44110e 0x453125 0x109f88c 0x109f7d5 0x10a472c 0x10a3f72 0x10a34ed 0x4795e1
#	0x109f88b	github.com/minio/minio/internal/grid.(*Connection).send+0x3eb			github.com/minio/minio/internal/grid/connection.go:548
#	0x109f7d4	github.com/minio/minio/internal/grid.(*Connection).queueMsg+0x334		github.com/minio/minio/internal/grid/connection.go:586
#	0x10a472b	github.com/minio/minio/internal/grid.(*Connection).handleAckMux+0xab		github.com/minio/minio/internal/grid/connection.go:1284
#	0x10a3f71	github.com/minio/minio/internal/grid.(*Connection).handleMsg+0x231		github.com/minio/minio/internal/grid/connection.go:1211
#	0x10a34ec	github.com/minio/minio/internal/grid.(*Connection).handleMessages.func1+0x6cc	github.com/minio/minio/internal/grid/connection.go:1019

---> blocks ---> via (Connection).handleMsgWg

1 @ 0x44110e 0x454165 0x454134 0x475325 0x486b08 0x10a161a 0x10a1465 0x2470e67 0x7395a9 0x20e61af 0x20e5f1f 0x7395a9 0x22f781c 0x7395a9 0x22f89a5 0x7395a9 0x22f6e82 0x7395a9 0x22f49a2 0x7395a9 0x2206e45 0x7395a9 0x22f4d9c 0x7395a9 0x210ba06 0x7395a9 0x23089c2 0x7395a9 0x22f86e9 0x7395a9 0xd42582 0x2106c04
#	0x475324	sync.runtime_Semacquire+0x24								runtime/sema.go:62
#	0x486b07	sync.(*WaitGroup).Wait+0x47								sync/waitgroup.go:116
#	0x10a1619	github.com/minio/minio/internal/grid.(*Connection).reconnected+0xb9			github.com/minio/minio/internal/grid/connection.go:857
#	0x10a1464	github.com/minio/minio/internal/grid.(*Connection).handleIncoming+0x384			github.com/minio/minio/internal/grid/connection.go:825
```

Add a queue cleaner in reconnected that will pop old messages so `handleMessages` can 
send messages without blocking and exit appropriately for the connection to be re-established.

Messages are likely dropped by the remote, but we may have some that can succeed, 
so we only drop when running out of space.
2024-06-20 16:11:40 -07:00
Shireesh Anjal
e200808ab7
fix errors in metrics code on macos (#19965)
- do not load proc fs metrics in case of macos
- null-check TimeStat before accessing
2024-06-20 10:55:03 -07:00
Klaus Post
fae563b85d
Add fixed timed restarts to updates (#19960) 2024-06-20 07:49:22 -07:00
Klaus Post
3e6dc02f8f
Add actual inline data to JSON output in xl-meta (#19958)
Add the inlined data as base64 encoded field and try to add a string version if feasible.

Example:

```
λ xl-meta -data xl.meta
{
  "8e03504e-1123-4957-b272-7bc53eda0d55": {
    "bitrot_valid": true,
    "bytes": 58,
    "data_base64": "Z29sYW5nLm9yZy94L3N5cyB2MC4xNS4wIC8=",
    "data_string": "golang.org/x/sys v0.15.0 /"
}
```

The string will have quotes, newlines escaped to produce valid JSON.

If content isn't valid utf8 or the encoding otherwise fails, only the base64 data will be added.

`-export` can still be used separately to extract the data as files (including bitrot).
2024-06-20 07:46:44 -07:00
Anis Eleuch
95e4cbbfde
Do not ping event targets during cluster initialization (#19959)
S3 operations are frozen during startup, therefore we should avoid pinging
event targets during the initialization since it can stall.
2024-06-20 07:46:02 -07:00
Harshavardhana
2825294b7b
allow server startup to come online with READ success (#19957) 2024-06-19 22:21:31 -07:00
Sveinn
bce93b5cfa
Removing timeout on shutdown (#19956) 2024-06-19 11:42:47 -07:00
Harshavardhana
7a4b250c8b
avoid waiting for quorum health while debugging (#19955) 2024-06-19 10:12:20 -07:00
Anis Eleuch
e5335450a4
test: Healing test to avoid infinite waiting for servers to be up (#19954)
tests: Healing test to avoid infinite waiting for servers to be up

Quit after 15 minutes and print server logs instead
2024-06-19 09:00:38 -07:00
Klaus Post
a6ffdf1dd4
Do not block on distributed unlocks (#19952)
* Prevents blocking when losing quorum (standard on cluster restarts).
* Time out to prevent endless buildup. Timed-out remote locks will be canceled because they miss the refresh anyway.
* Reduces latency for all calls since the wall time for the roundtrip to remotes no longer adds to the requests.
2024-06-19 07:35:19 -07:00
Harshavardhana
69e41f87ef
compute localIPs only once per server startup() (#19951)
repeatedly calling this function is not necessary,
on systems with lots of interfaces, including virtual
ones can make this reasonably delayed.
2024-06-19 07:34:00 -07:00
Harshavardhana
ee48f9f206
perform healthchecks before initializing everything fully (#19953)
adds more informative logs that provide details on which
erasure set is losing quorum etc.
2024-06-19 07:33:40 -07:00
Sveinn
9ba39d7fad
Removing a channel that was not being used (#19948) 2024-06-19 01:59:39 -07:00
Harshavardhana
d2fb371f80
do not need response record body (#19949)
since the connection is active, the
response recorder body can grow endlessly
causing leak, as this bytes buffer is
never given back to GC due to an goroutine.
2024-06-19 01:59:21 -07:00
Klaus Post
2f9018f03b
Do regular checks for healing status while scanning (#19946) 2024-06-18 09:11:04 -07:00
Harshavardhana
eb990f64a9 update pkger to use v2.3.1 2024-06-17 22:48:41 -07:00
Harshavardhana
bbb64eaade
skip healing properly in the scanner when a drive is hotplugged (#19939)
skip healing properly in scanner when drive is hotplugged

due to how the state is passed around the SkipHealing
might not be the true state() of the system always, causing
a situation where we might healing from the scanner on the
same drive which is being. Due to this competing heals get
triggered that slow each other down.
2024-06-17 16:39:11 -07:00
Harshavardhana
7bd1d899bc
remove overzealous check during HEAD() (#19940)
due to a historic bug in CopyObject() where
an inlined object loses its metadata, the
check causes an incorrect fallback verifying
data-dir.

CopyObject() bug was fixed in ffa91f9794 however
the occurrence of this problem is historic, so
the aforementioned check is stretching too much.

Bonus: simplify fileInfoRaw() to read xl.json as well,
also recreate buckets properly.
2024-06-17 07:29:18 -07:00
Harshavardhana
c91d1ec2e3
fix: avoid metadata cache without data for all callers (#19935) 2024-06-14 06:28:35 -07:00
Minio Trusted
c50b64027d Update yaml files to latest version RELEASE.2024-06-13T22-53-53Z 2024-06-14 05:40:03 +00:00
Cesar N
20960b6a2d
Update console to v1.6.0 (#19933) 2024-06-13 15:53:53 -07:00
Shubhendu
3bd3470d0b
Corrected names of node replication metrics (#19932)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-06-13 15:26:54 -07:00
Harshavardhana
ba39ed9af7
loadUser() if not able to load() credential return error (#19931) 2024-06-13 15:26:38 -07:00
jiuker
62e6dc950d
fix: do not update metadata cache upon headObject() (#19929) 2024-06-13 08:42:02 -07:00
Harshavardhana
5a5046ce45
upgrade all deps and credits (#19930)
Signed-off-by: Harshavardhana <harsha@minio.io>
2024-06-13 08:34:20 -07:00
Klaus Post
ad04afe381
Fix SSEC multipart checksum replication (#19915)
* Multipart SSEC checksums were not transferred.
* Remove key mismatch logging. This key is user-controlled with SSEC.
* If the source is SSEC and the destination reports ErrSSEEncryptedObject, 
  assume replication is good.
2024-06-12 23:56:12 -07:00
Harshavardhana
ba9f0f2480
fix: attempt to fix CI/CD upgrade tests with docker-compose (#19926) 2024-06-12 22:08:11 -07:00
Harshavardhana
d06b63d056
load credential for in-flights requests as singleflight (#19920)
avoid concurrent callers for LoadUser() to even initiate
object read() requests, if an on-going operation is in progress.

this avoids many callers hitting the drives causing I/O
spikes, also allows for loading credentials faster.
2024-06-12 13:47:56 -07:00
Andreas Auernhammer
7ce28c3b1d
kms: use GetClientCertificate callback for KES API keys (#19921)
This commit fixes an issue in the KES client configuration
that can cause the following error when connecting to KES:
```
ERROR Failed to connect to KMS: failed to generate data key with KMS key: tls: client certificate is required
```

The Go TLS stack seems to not send a client certificate if it
thinks the client certificate cannot be validated by the peer.
In case of an API key, we don't care about this since we use
public key pinning and the X.509 certificate is just a transport
encoding.

The `GetClientCertificate` seems to be honored always such that
this error does not occur.

Signed-off-by: Andreas Auernhammer <github@aead.dev>
2024-06-12 07:31:26 -07:00
Harshavardhana
e3ac4035b9
decrement requests inqueue correctly after the request is processed (#19918) 2024-06-12 01:13:12 -07:00
Harshavardhana
d21b6daa49
fix: avoid crash when delete() returns an error in batch expiration (#19909) 2024-06-11 06:50:53 -07:00
Minio Trusted
76ebb16688 Update yaml files to latest version RELEASE.2024-06-11T03-13-30Z 2024-06-11 06:11:10 +00:00
Harshavardhana
55aa431578
fix: on windows avoid ':' as part of the object name (#19907)
fixes #18865
avoid-colon
2024-06-10 20:13:30 -07:00
Harshavardhana
614981e566
allow purge expired STS while loading credentials (#19905)
the reason for this is to avoid STS mappings to be
purged without a successful load of other policies,
and all the credentials only loaded successfully
are properly handled.

This also avoids unnecessary cache store which was
implemented earlier for optimization.
2024-06-10 11:45:50 -07:00
Harshavardhana
b8b956a05d add changes to Makefile to support dev build 2024-06-10 10:41:02 -07:00
Klaus Post
d2eed44c78
Fix replication checksum transfer (#19906)
Compression will be disabled by default if SSE-C is specified. So we can still honor SSE-C.
2024-06-10 10:40:33 -07:00
Anis Eleuch
789cbc6fb2
heal: Dangling check to evaluate object parts separately (#19797) 2024-06-10 08:51:27 -07:00
jiuker
0662c90b5c
fix: copyObject restore with a specific version, update test cases (#19895) 2024-06-10 08:50:49 -07:00
Klaus Post
a2cab02554
Fix SSE-C checksums (#19896)
Compression will be disabled by default if SSE-C is specified. So we can still honor SSE-C.
2024-06-10 08:31:51 -07:00
Harshavardhana
6c7a21df6b
turn-off unexpected debug logging in List() calls (#19903) 2024-06-09 21:34:26 -07:00
Ali Afsharzadeh
f933b0b708
Upgrade setup-helm action from v3 to v4 (#19897) 2024-06-09 02:13:09 -07:00
Shubhendu
9f305273a7
Added tests for replication of checksum headers (#19879)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-06-08 09:24:15 -07:00
Cesar N
cbd9efcb43
Update console to v1.5.0 (#19899) 2024-06-08 01:12:00 -07:00
Harshavardhana
29a25a538f
fix: make sure we list freeVersions like DEL marker with --versions (#19878)
freeVersions() was being incorrectly skipped; list it as
valid objects properly.

Co-authored-by: Krishnan Parthasarathi <Krishnan Parthasarathi>
2024-06-07 15:18:44 -07:00
Harshavardhana
2dd8faaedc remove unnecessary log in Listing() 2024-06-07 14:52:55 -07:00
Klaus Post
f00187033d
Two way streams for upcoming locking enhancements (#19796) 2024-06-07 08:51:52 -07:00
Aditya Manthramurthy
c5141d65ac
Update docker build script to pull all changes (#19892) 2024-06-07 08:43:38 -07:00
Krishnan Parthasarathi
069c4015cd
Don't tier directory objects (#19891)
Directory objects are used by applications that simulate the folder
structure of an on-disk filesystem. These are zero-byte objects with names
ending with '/'. They are only used to check whether a 'folder' exists in
the namespace.
2024-06-07 08:43:17 -07:00
Shubhendu
2f6e03fb60
Calculate correct object size while replication (#19888)
It was missing in case of `replicateObject` but was present for
`replicateAll` already

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-06-06 12:31:01 -07:00
Klaus Post
0fbb945e13
Disable caching of encrypted objects (#19890)
Don't write encrypted objects to cache, if configured.
2024-06-06 11:39:18 -07:00