Commit Graph

6363 Commits

Author SHA1 Message Date
Harshavardhana
bc527eceda
handle the actualSize() properly for PostUpload() (#20422)
postUpload() incorrectly saves actual size as '-1'
we should save correct size when its possible.

Bonus: fix the PutObjectPart() write locker, instead
of holding a lock before we read the client stream.

We should hold it only when we need to commit the parts.
2024-09-11 11:35:37 -07:00
Anis Eleuch
b963f36e1e
fix: Add missing grid handler of clearing upload-id from the cache (#20420) 2024-09-11 09:09:13 -07:00
Poorna
cdd7512a2e
use rename() safety for in-place 'xl.meta' updates (#20414) 2024-09-11 09:08:51 -07:00
Shubhendu
0b7aa6af87
Skip non existent ldap entities while import (#20352)
Dont hard error for nonexisting LDAP entries instead of logging them
report them via `mc`

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-09-09 09:59:28 -07:00
Harshavardhana
8c9ab85cfa
Add multipart uploads cache for ListMultipartUploads() (#20407)
this cache will be honored only when `prefix=""` while
performing ListMultipartUploads() operation.

This is mainly to satisfy applications like alluxio
for their underfs implementation and tests.

replaces https://github.com/minio/minio/pull/20181
2024-09-09 09:58:30 -07:00
Klaus Post
b1c849bedc
Don't send a canceled context to Unlock (#20409)
AFAICT we send a canceled context to unlock (and thereby releaseAll). This will cause network calls to fail.

Instead use background and add 30s timeout.
2024-09-09 08:49:49 -07:00
Klaus Post
9d5cdaa2e3
Limit Response Recorder memory (#20399)
Disable body recording for...

* admin inspect
* admin metrics
* profiling download

Also, if the recorded body is > 10MB, drop it.
2024-09-07 12:16:04 -07:00
Taran Pelkey
84e122c5c3
Fix duplicate groups in ListGroups API (#20396) 2024-09-06 17:28:47 -07:00
Harshavardhana
64e803b136
fix: avoid waiting on rebalance metadata (#20392)
rebalance metadata is good to have only,
if it cannot be loaded when starting MinIO
for some reason we can possibly ignore it
and move on and let user start rebalance
again if needed.
2024-09-06 06:20:19 -07:00
Krishnan Parthasarathi
a0f9e9f661
readParts: Return error when quorum unavailable (#20389)
readParts requires that both part.N and part.N.meta files be present.
This change addresses an issue with how an error to return to the upper
layers was picked from most drives where a UploadPart operation 
had failed.
2024-09-06 03:51:23 -07:00
Harshavardhana
b6b7cddc9c
make sure listParts returns parts that are valid (#20390) 2024-09-06 02:42:21 -07:00
Harshavardhana
85f08d7752
verify part.N exists before reading part.N.meta (#20383)
if part.N doesn't exist we do not have to complete
the multipart transaction, it simply means that we
have some partial upload situation at hand.
2024-09-05 13:37:19 -07:00
Poorna
060276932d
batch:repl fix copy from source -> remote (#20382)
completes fix started by  #20365
2024-09-05 04:57:23 -07:00
Shubhendu
6224849fd1
Dont start console service if MINIO_BROWSER=off (#20374)
By default, even if MINIO_BROWSER=off set code tries to get free
port available for the console. 

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-09-04 10:02:39 -07:00
Harshavardhana
c2e318dd40
remove mincache EOS related feature from upstream (#20375) 2024-09-03 11:23:41 -07:00
T-TRz879
69258d5945
ignore if-unmodified-since header if if-match is set (#20326) 2024-09-02 23:33:53 -07:00
Mark Theunissen
d7ef6315ae
Store the checksum in PostPolicyHandler so that we can return it on Get/Head (#20364) 2024-09-02 06:13:54 -07:00
Anis Eleuch
aaf4fb1184
batch: repl: A missing prefix in the remote source will fail replication (#20365)
When the prefix field is not provided in the remote source of a yaml
replication job, the code fails to do listing and makes replication
successful. This commit fixes it.
2024-09-02 05:36:43 -07:00
Harshavardhana
7a34c88d73
add consistent nonce to make multipart deterministic per part (#20359)
This change adds a consistent nonce to ensure
that multipart uploads are deterministic on a 
per-part basis.

Thanks to @klauspost for the work here minio/sio@3cd3734
2024-08-31 11:25:48 -07:00
Harshavardhana
6c746843ac
fix: keep locks on same pool for simplicity (#20356)
locks handed by different pools would become non-compete for
multi-object delete request, this is wrong for obvious 
reasons.

New locking implementation and revamp will rewrite multi-object
lock anyway, this is a workaround for now.
2024-08-30 19:26:49 -07:00
Harshavardhana
bb07df7e7b
do not list dangling objects with unmatched ECs (#20351)
This mostly applies to all new objects, this simply
ignores these objects and no application would have
to deal with getting 503s on them.
2024-08-30 09:02:26 -07:00
Harshavardhana
504e52b45e
protect bpool from buffer pollution by invalid buffers (#20342) 2024-08-28 18:40:52 -07:00
Anis Eleuch
38c0840834
bucket-metadata: Reload events/repl-targets for all buckets (#20334)
Currently, the bucket events and replication targets are only reloaded
with buckets that failed to load during the first cluster startup,
which is wrong because if one bucket change was done in one node but
that node was not able to notify other nodes; the other nodes will
reload the bucket metadata config but fails to set the events and bucket
targets in the memory.
2024-08-28 08:32:18 -07:00
Harshavardhana
fb2360ff88
when a drive is closed cancel the cleanupTrash goroutine (#20337)
when a hung drive is hot-unplugged, the server might go
into a loop where the previous `format.json` is somehow
still accessible to the process, we try to re-init() drives,
but that seems to cause a previous goroutine to hang around
since it is not canceled away when the drive is closed.

Bonus: add deadline for immediate purge routine, to unblock
it if the drive is blocking mutations.
2024-08-28 08:31:42 -07:00
jiuker
1a2de1bdde
fix: string format when log IAM refresh take over 5s (#20331) 2024-08-26 23:40:33 -07:00
Harshavardhana
af55f37b27
do not fallback on the drives to load groups for LDAP (#20320)
if a user policy is found, avoid reading from the drives
for missing group mappings, group mappings are not mandatory
and conditional.

This PR restores the older behavior while making sure that
if a direct user policy is not found, we would still attempt
to load from the group from the drives.
2024-08-25 17:22:45 -07:00
Andreas Auernhammer
2d67c26794
improve multipart decryption (#20324)
This commit simplifies and optimizes the decryption of large (multipart)
objects. This PR does two things:
 
- Re-write the init logic for the decryption reader
- Reduce the number of OEK decryptions

Before, the init logic copied some SSE HTTP request headers to 
parse them later. This is simplified to parsing them right away. This
removes some fields from the decryption reader struct.

Further, the decryption reader decrypted the OEK using the client-provided 
key (SSE-C) or the KMS (SSE-S3 / SSE-KMS) for each part. This is redundant 
since the OEK is the same for all parts. In particular, a KMS call might be a 
network request. Now, the OEK is decrypted once for the entire multipart object.

This should improve latency when reading encrypted multipart objects 
and reduce requests to the KMS.

Signed-off-by: Andreas Auernhammer <github@aead.dev>
2024-08-25 11:07:13 -07:00
Harshavardhana
006cacfefb
to turn-off healing drop legacy ENV (#20315) 2024-08-23 15:43:31 -07:00
bestgopher
c28f09d4a7
refactor: displays the OS-specific doc url (#20313) 2024-08-23 07:11:35 -07:00
Anis Eleuch
73992d2b9f
s3: DeleteBucket to use listing before returning bucket not empty error (#20301)
Use Walk(), which is a recursive listing with versioning, to check if 
the bucket has some objects before being removed. This is beneficial
because the bucket can contain multiple dangling objects in multiple
drives.

Also, this will prevent a bug where a bucket is deleted in a deployment
that has many erasure sets but the bucket contains one or few objects
not spread to enough erasure sets.
2024-08-22 14:57:20 -07:00
Anis Eleuch
a8f143298f
heal: Reset healing params when a retry is decided (#20285)
Currently, retry healing of a new drive healing does not reset
HealedBuckets means that the next healing retry will skip those
buckets. The commit will fix this behavior.

Also, the skipped objects counter will include objects uploaded
that are uploaded after the healing is started.
2024-08-22 05:35:43 -07:00
jiuker
2d44c161c7
fix: support export bucket policy with ExportBucketMetadata (#20308) 2024-08-22 03:44:35 -07:00
Mark Theunissen
fb4ad000b6
support parseObjectAttributes to handle multiple header values (#20295) 2024-08-21 14:13:59 -07:00
shandongzhejiang
a8ff12bc72
chore: fix some comments (#20294)
Signed-off-by: shandongzhejiang <shandongzhejiang@icloud.com>
2024-08-21 13:14:24 -07:00
jiuker
1e1bd3afd9
use io.NopCloser replace closeWrapper (#20287) 2024-08-21 05:20:54 -07:00
Anis Eleuch
7b239ae154
sftp: Fix operations with a internal service account (#20293)
sftp sends local requests to the S3 port while passing the session token
header when the account corresponds to a service account. However, this
is not permitted and will throw an error: "The security token included in the
request is invalid"

This commit will avoid passing the session token to the upper layer that
initializes MinIO client to avoid this error.
2024-08-20 13:00:29 -07:00
Anis Eleuch
85c3db3a93
heal: Add finished flag to .healing.bin to avoid removing this latter (#20250)
Sometimes, we need historical information in .healing.bin, such as the
number of expired objects that the healing avoids to heal and that can
create drive usage disparency in the same erasure set. For that reason,
this commit will not remove .healing.bin anymore and it will have a new
field called Finished so we know healing is finished in that drive.
2024-08-20 08:42:49 -07:00
Mark Theunissen
6378ca10a4
kms.ListKeys returns CreatedBy/CreatedAt when information is available (#20223) 2024-08-17 23:43:03 -07:00
Harshavardhana
72cff79c8a
add missing STS accounts loading (#20279)
PR #20268 missed loading STS accounts map properly
2024-08-16 18:24:54 -07:00
Harshavardhana
a5702f978e
remove requests deadline, instead just reject the requests (#20272)
Additionally set

 - x-ratelimit-limit
 - x-ratelimit-remaining

To indicate the request rates.
2024-08-16 01:43:49 -07:00
Poorna
4687c4616f
try loading temp account if not in cache (#20266) 2024-08-15 23:12:42 -07:00
Harshavardhana
cc0c41d216
remove region locks and make them simpler (#20268)
- single flight approach is now optional, instead of default.
- parallelize the loaders upto 32 items per assets (more room for improvement possible)
2024-08-15 08:41:03 -07:00
Klaus Post
f1302c40fe
Fix uninitialized replication stats (#20260)
Services are unfrozen before `initBackgroundReplication` is finished. This means that 
the globalReplicationStats write is racy. Switch to an atomic pointer.

Provide the `ReplicationPool` with the stats, so it doesn't have to be grabbed 
from the atomic pointer on every use.

All other loads and checks are nil, and calls return empty values when stats 
still haven't been initialized.
2024-08-15 05:04:40 -07:00
Klaus Post
d96798ae7b
Add support profile deadlines and concurrent operations (#20244)
* Allow a maximum of 10 seconds to start profiling operations.
* Download up to 16 profiles concurrently, but only allow 10 seconds for
  each (does not include write time).
* Add cluster info as the first operation.
* Ignore remote download errors.
* Stop remote profiles if the request is terminated.
2024-08-15 03:36:00 -07:00
Anis Eleuch
b508264ac4
sr: Avoid recursion when loading site replicator credentials (#20262)
If the site replication is enabled and the code tries to extract jwt
claims while the site replication service account credentials are still
not loaded yet, the code will enter an infinite loop, causing in a
high CPU usage.

Another possibility of the infinite loop is having some service accounts
created by an old deployment version where the service account JWT was
signed by the root credentials, but not anymore.

This commit will remove the possibility of the infinite loop in the code
and add root credential fallback to extract claims from old service
accounts.
2024-08-14 18:29:20 -07:00
Harshavardhana
db78431b1d
avoid crash when initializing bucket quota cache (#20258) 2024-08-14 17:34:56 -07:00
Klaus Post
3ffeabdfcb
Fix govet+staticcheck issues (#20263)
This is better: https://github.com/golang/go/issues/60529
2024-08-14 10:11:51 -07:00
Anis Eleuch
51b1f41518
heal: Persist MRF queue in the disk during shutdown (#19410) 2024-08-13 15:26:05 -07:00
Harshavardhana
e7a56f35b9
flatten out audit tags, do not send as free-form (#20256)
move away from map[string]interface{} to map[string]string
to simplify the audit, and also provide concise information.

avoids large allocations under load(), reduces the amount
of audit information generated, as the current implementation
was a bit free-form. instead all datastructures must be
flattened.
2024-08-13 15:22:04 -07:00
rubyisrust
516af01a12
chore: fix some function names (#20243)
Signed-off-by: rubyisrust <rustrover@icloud.com>
2024-08-13 11:23:33 -07:00
Harshavardhana
acdb355070
update deps and update azure WARM tier implementation (#20247) 2024-08-13 11:21:34 -07:00
Mark Theunissen
37c02a5f7b
Add dummy DeleteBucketCors for safety (#20253) 2024-08-13 08:25:16 -07:00
Krishnan Parthasarathi
04be352ae9
Relax quorum agreement on DataDir values (#20232)
Previously, we checked if we had a quorum on the DataDir value. 
We are removing this check, which allows reading objects with different 
DataDir values in a few drives (due to a rebalance-stop race bug) 
provided their eTags or ModTimes match.
2024-08-12 12:02:21 -07:00
Klaus Post
53eb7656de
Add admin info timeouts (#20249)
Since a lot of operations load from storage, do remote calls, add a 10 second timeout to each operation.

This should make `mc admin info` return values even under extreme conditions.
2024-08-12 10:24:29 -07:00
Harshavardhana
2e0fd2cba9
implement a safer completeMultipart implementation (#20227)
- optimize writing part.N.meta by writing both part.N
  and its meta in sequence without network component.

- remove part.N.meta, part.N which were partially success
  ful, in quorum loss situations during renamePart()

- allow for strict read quorum check arbitrated via ETag
  for the given part number, this makes it double safer
  upon final commit.

- return an appropriate error when read quorum is missing,
  instead of returning InvalidPart{}, which is non-retryable
  error. This kind of situation can happen when many
  nodes are going offline in rotation, an example of such
  a restart() behavior is statefulset updates in k8s.

fixes #20091
2024-08-12 01:38:15 -07:00
Harshavardhana
909b169593
avoid source index to be same as destination index (#20238)
during rebalance stop, it can possibly happen that
Put() would race by overwriting the same object again.

This may very well if done "successfully" it can
potentially proceed to delete the object from the pool,
causing data loss.

This PR enhances #20233 to handle more scenarios such
as these.
2024-08-09 19:30:44 -07:00
Krishnan Parthasarathi
4e67a4027e
Prevent overwrites due to rebalance-stop race (#20233)
Rebalance-stop can race with ongoing rebalance operations. This change
prevents these operations from overwriting objects by checking the source
and destination pool indices are different.
2024-08-08 19:05:14 -07:00
Klaus Post
49055658a9
Fix missing hash in GetObjectAttributes (#20231)
SHA256/SHA1 were mixed up.

Simplify code as well.
2024-08-08 13:19:41 -07:00
Harshavardhana
89c58ce87d
enhance getActualSize() to return valid values for most situations (#20228) 2024-08-08 08:29:58 -07:00
Mark Theunissen
2681219039
Add dummy PutBucketCors for functional test compatibility (#20220) 2024-08-06 08:41:38 -07:00
Harshavardhana
dea9abed29
use singleflight when bucket metadata is reloaded() (#20216)
this allows for de-duplicating the callers when called
concurrently, allowing for bucketmetadata reads to be
single call. All concurrent callers will get the same data
as the first one.
2024-08-05 09:50:11 -07:00
Harshavardhana
e3eb5c1328
batch-exp: Remove 1000 maximum objects per call (#20212)
It seems ObjectAPI.DeleteObjects() is clogging up when it is removing
10k versions of a single object.

Authored-by: Anis Eleuch <anis@min.io>
2024-08-04 21:55:25 -07:00
Poorna
74c047cb03
fix replication last hour metric (#20199)
also adding missing recent_backlog_count metric to v3 metrics
2024-08-01 17:55:27 -07:00
jiuker
50a5ad48fc
feat: support batch replication prefix slice (#20033) 2024-08-01 05:53:30 -07:00
Harshavardhana
a9dc061d84
count metrics properly for any failures during drive heal (#20193)
or via `mc admin heal --set 1 --pool 1`
2024-07-30 22:46:26 -07:00
Krishnan Parthasarathi
01a8c09920
Add fmt-gen subcommand (#20192)
fmt-gen subcommand is only available when built with build tag `fmtgen`.
2024-07-30 15:59:48 -07:00
Aditya Manthramurthy
4c8562bcec
Fix v2 metrics: Send all ttfb api labels (#20191)
Fix a regression in #19733 where TTFB metrics for all APIs except
GetObject were removed in v2 and v3 metrics. This causes breakage for
existing v2 metrics users. Instead we continue to send TTFB for all APIs
in V2 but only send for GetObject in V3.
2024-07-30 15:28:46 -07:00
Harshavardhana
f13c04629b
allow multipart uploads expiration to be dynamic (#20190)
allow multipart uploads expiration to be dyamic

It would seem like the new values will take effect
only after a restart for changes in multipart_expiration.
This PR fixes this by making it dynamic as it should have
been.
2024-07-30 12:01:06 -07:00
Harshavardhana
80ff907d08
add DeleteBulk support, add sufficient deadlines per rename() (#20185)
deadlines per moveToTrash() allows for a more granular timeout
approach for syscalls, instead of an aggregate timeout.

This PR also enhances multipart state cleanup to be optimal by
removing 100's of multipart network rename() calls into single
network call.
2024-07-29 18:56:40 -07:00
Poorna
2d40433bc1
remove replication throttle deadline for objects > 128MiB (#20184)
context deadline was introduced to avoid a slow transfer from blocking
replication queue(s) shared by other buckets that may not be under throttling.

This PR removes this context deadline for larger objects since they are 
anyway restricted to a limited set of workers. Otherwise, objects would 
get dequeued when the throttle limit is exceeded and cannot proceed 
within the deadline.
2024-07-29 15:14:52 -07:00
Harshavardhana
a17f14f73a
separate lock from common grid to avoid epoll contention (#20180)
epoll contention on TCP causes latency build-up when
we have high volume ingress. This PR is an attempt to
relieve this pressure.

upstream issue https://github.com/golang/go/issues/65064
It seems to be a deeper problem; haven't yet tried the fix
provide in this issue, but however this change without
changing the compiler helps. 

Of course, this is a workaround for now, hoping for a
more comprehensive fix from Go runtime.
2024-07-29 11:10:04 -07:00
Poorna
6651c655cb
fix replication of checksum when encryption is enabled (#20161)
- Adding functional tests
- Return checksum header on GET/HEAD, previously this was returning
  InvalidPartNumber error
2024-07-29 01:02:16 -07:00
Harshavardhana
3ae104edae
change Read* calls over net/http to move to http.MethodGet (#20173)
- ReadVersion
- ReadFile
- ReadXL

Further changes include to

- Compact internode resource RPC paths
- Compact internode query params

To optimize on parsing by gorilla/mux as the
length of this string increases latency in
gorilla/mux - reduce to a meaningful string.
2024-07-29 01:00:12 -07:00
jiuker
c87a489514
fix: support prefix when batchJob replicate enable the snowball (#20178) 2024-07-29 00:59:50 -07:00
Poorna
641a56da0d
fix panic in replication queuing (#20169)
Regression from #20077

```
Jul 26 19:08:29 minio-dr-0101a minio[275423]: Error: grid handler (NSScanner) panic: runtime error: index out of range [4] with length 1 (*errors.errorString)
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       33: internal/logger/logger.go:268:logger.LogIf()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       32: internal/grid/connection.go:50:grid.gridLogIf()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       31: internal/grid/muxserver.go:234:grid.(*muxServer).handleRequests.func1()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       30: cmd/bucket-replication.go:2165:cmd.(*ReplicationPool).queueReplicaTask()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       29: cmd/bucket-replication.go:3440:cmd.queueReplicationHeal()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       28: cmd/data-scanner.go:1396:cmd.(*scannerItem).healReplication()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       27: cmd/data-scanner.go:1220:cmd.(*scannerItem).applyActions()
Jul 26 19:08:29 minio-dr-0101a minio[275423]:       26: cmd/xl-storage.go:627:cmd.(*xlStorage).NSScanner.func2()
```
2024-07-26 13:48:21 -07:00
Harshavardhana
a16193bb50
remove fdatasync() discard, we write with O_SYNC (#20168)
fdatasync() discard for page-cached READs is not
needed, it would seem like this can cause latencies
in situations when things are loaded.
2024-07-26 10:27:56 -07:00
jiuker
132e7413ba
fix: check once ready for site-replication (#20149) 2024-07-26 10:27:42 -07:00
Klaus Post
1966668066
Avoid Batch Replication Job log spam (#20158)
Only print once per job and error location.

Set default retry to default 1 second wait, and use as minimum.
2024-07-26 05:55:50 -07:00
Harshavardhana
064f36ca5a
move to GET for internal stream READs instead of POST (#20160)
the main reason is to let Go net/http perform necessary
book keeping properly, and in essential from consistency
point of view its GETs all the way.

Deprecate sendFile() as its buggy inside Go runtime.
2024-07-26 05:55:01 -07:00
Krishnan Parthasarathi
4a1edfd9aa
Different read quorum for tiered objects (#20115)
For a non-tiered object, MinIO requires that EcM (# of data blocks) of
xl.meta agree, corresponding to the number of data blocks needed to 
read this object.

OTOH, tiered objects have metadata in the hot tier and data in the 
warm tier. The data and its integrity are offloaded to the warm tier. This
allows us to reduce the read quorum from EcM (typically > N/2, where N -
erasure stripe width) to N/2 + 1. The simple majority of metadata
ensures consensus on what the object is and where it is
located.
2024-07-25 14:02:50 -07:00
Anis Eleuch
b7f319b62a
properly reload a fresh drive when found in a failed state during startup (#20145)
When a drive is in a failed state when a single node multiple drives
deployment is started, a replacement of a fresh disk will not be
properly healed unless the user restarts the node.

Fix this by always adding the new fresh disk to globalLocalDrivesMap. Also
remove globalLocalDrives for simplification, a map to store local node
drives can still be used since the order of local drives of a node is
not defined.
2024-07-24 16:30:33 -07:00
Anis Eleuch
33c101544d
kms: Expose API when bucket federation is enabled (#20143)
kms: Expose API available when bucket federation is enabled

When bucket federation feature is enabled, KMS API will not work, such
as `mc admin kms key list`

The commit will fix the issue by disabling bucket forwarding when this
is a KMS request.
2024-07-24 15:44:29 -07:00
Harshavardhana
3b21bb5be8
use unixNanoTime instead of time.Time in lockRequestorInfo (#20140)
Bonus: Skip Source, Quorum fields in lockArgs that are never
sent during Unlock() phase.
2024-07-24 03:24:01 -07:00
Harshavardhana
6fe2b3f901
avoid sendFile() for ranges or object lengths < 4MiB (#20141) 2024-07-24 03:22:50 -07:00
Taran Pelkey
b368d4cc13
Fix updateGroupMembershipsForLDAP behavior with unicode (#20137) 2024-07-23 19:10:03 -07:00
Klaus Post
0680af7414
Set O_NONBLOCK for reads and writes on unix (#20133)
Tracing syscalls, opening and reading an `xl.meta` looks like this:

```
openat(AT_FDCWD, "/mnt/drive1/ss8-old/testbucket/ObjSize4MiBThreads72/(554O51H/peTb(0iztdbTKw59.csv/xl.meta", O_RDONLY|O_NOATIME|O_CLOEXEC) = 34 <0.000>
fcntl(34, F_GETFL)                      = 0x48000 (flags O_RDONLY|O_LARGEFILE|O_NOATIME) <0.000>
fcntl(34, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_NOATIME) = 0 <0.000>
epoll_ctl(4, EPOLL_CTL_ADD, 34, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3172471557, u64=8145488475984499461}}) = -1 EPERM (Operation not permitted) <0.000>
fcntl(34, F_GETFL)                      = 0x48800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_NOATIME) <0.000>
fcntl(34, F_SETFL, O_RDONLY|O_LARGEFILE|O_NOATIME) = 0 <0.000>
fstat(34, {st_mode=S_IFREG|0644, st_size=354, ...}) = 0 <0.000>
read(34, "XL2 \1\0\3\0\306\0\0\1P\2\2\1\304$\225\304\20\0\0\0\0\0\0\0\0\0\0\0"..., 354) = 354 <0.000>
close(34)                               = 0 <0.000>
```

Everything until `fstat` is the `os.Open` call.

Looking at the code: https://github.com/golang/go/blob/master/src/os/file_unix.go#L212-L243

It seems for every file it "tries" to see if it is pollable. This causes `syscall.SetNonblock(fd, true)` to be called. This is the first `F_SETFL`.

It then calls `f.pfd.Init("file", true)`. This will attempt to set it as pollable using `epoll_ctl`. This will always fail for files. It therefore calls `syscall.SetNonblock(fd, false)` resulting in the second `F_SETFL`.

If we set the `O_NONBLOCK` call on the initial open, we should avoid the 4 `fcntl` syscalls per file.

I don't see any way to avoid the `epoll_ctl` call, since kind is either `kindOpenFile` or `kindNonBlock`, so "pollable" will always be true. However avoiding 4 of 6 syscalls still seems worth it.

This should not have any effect, since files will end up with "nonblock" anyway.
2024-07-23 09:36:24 -07:00
Harshavardhana
91805bcab6
add optimizations to bring performance on unversioned READS (#20128)
allow non-inlined on disk to be inlined via
an unversioned ReadVersion() call, we only
need ReadXL() to resolve objects with multiple
versions only.

The choice of this block makes it to be dynamic
and chosen by the user via `mc admin config set`

Other bonus things

- Start measuring internode TTFB performance.
- Set TCP_NODELAY, TCP_CORK for low latency
2024-07-23 03:53:03 -07:00
jiuker
b3a94c4e85
fix: Use xtime duration to parse batch job (#20117) 2024-07-23 00:05:53 -07:00
Harshavardhana
8e618d45fc
remove unnecessary LRU for internode auth token (#20119)
removes contentious usage of mutexes in LRU, which
were never really reused in any manner; we do not
need it.

To trust hosts, the correct way is TLS certs; this PR completely
removes this dependency, which has never been useful.

```
0  0%  100%  25.83s 26.76%  github.com/hashicorp/golang-lru/v2/expirable.(*LRU[...])
0  0%  100%  28.03s 29.04%  github.com/hashicorp/golang-lru/v2/expirable.(*LRU[...])
```

Bonus: use `x-minio-time` as a nanosecond to avoid unnecessary
parsing logic of time strings instead of using a more
straightforward mechanism.
2024-07-22 00:04:48 -07:00
Harshavardhana
3ef59d2821
do not set KMSSecretKey env from KMSSecretKeyFile (#20122)
fixes #20121
2024-07-21 14:39:15 -07:00
Anis Eleuch
d9ee668b6d
s3: Fix wrong continuation token during listing with ILM enabled bucket (#20113) 2024-07-18 13:37:34 -07:00
Anis Eleuch
2e5d792f0c
batch-expiry: Save progress regularly in the drives and at the end (#20098)
- Also, fix failure reporting at the end.
- Also, avoid parsing report objects when listing or resuming jobs, this
does not cause any bugs, it is only printing, not useful errors.
2024-07-17 09:42:32 -07:00
Poorna
3535197f99
replication: proxy only on missing object or read quorum err (#20101) 2024-07-16 16:46:41 -07:00
Mark Theunissen
698bb93a46
Allow a KMS Action to specify keys in the Resources of a policy (#20079) 2024-07-16 07:03:03 -07:00
Harshavardhana
e8c54c3d6c
add validation test for v3 metrics for all its endpoints (#20094)
add unit test for v3 metrics for all its exposed endpoints

Bonus:
  - support OpenMetrics encoding
  - adds boot time for prometheus
  - continueOnError is better to serve as
    much metrics as possible.
2024-07-15 09:28:02 -07:00
Shubhendu
f944a42886
Removed user and group details from logs (#20072)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-07-14 11:12:07 -07:00
Harshavardhana
eff0ea43aa
fix: typo in BucketUsageMetrics group registration in v3 metrics (#20090)
```
curl http://localhost:9000/minio/metrics/v3/cluster/usage/buckets
```

Did not work as documented, due to the fact that there was a typo
in the bucket usage metrics registration group. This endpoint is
a cluster endpoint and does not require any `buckets` argument.
2024-07-14 11:11:42 -07:00
Harshavardhana
7fcb428622
do not print unexpected logs (#20083) 2024-07-12 13:51:54 -07:00
Klaus Post
83adc2eebf
Fix ListObjects aborting after 3 minute on async request (#20074)
When creating the async listing, if the first request does not return within 3 
minutes, it is stopped, since it isn't being kept alive.

Keep updating `lastHandout` while we are waiting for the initial request to be fulfilled.
2024-07-12 09:23:16 -07:00
Poorna
989c318a28
replication: make large workers configurable (#20077)
This PR also improves throttling by reducing tokens requested
from rate limiter based on available tokens to avoid exceeding
throttle wait deadlines
2024-07-12 07:57:31 -07:00