Commit Graph

324 Commits

Author SHA1 Message Date
Harshavardhana
74851834c0
further bootstrap/startup optimization for reading 'format.json' (#18868)
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
  for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
  to make a NoOp DiskInfo call.
2024-01-25 12:45:46 -08:00
Harshavardhana
e377bb949a
migrate bootstrap logic directly to websockets (#18855)
improve performance for startup sequences by 2x for 300+ nodes.
2024-01-24 13:36:44 -08:00
Klaus Post
4a6c97463f
Fix all racy use of NewDeadlineWorker (#18861)
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.

Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.

Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.

Fixes races like:

```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
  github.com/minio/minio/cmd.readFileInfo()
      github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
  github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
      github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
  github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
      github.com/minio/minio/cmd/erasure-object.go:828 +0x61

Previous write at 0x00c130b29d10 by goroutine 470298:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
  github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
      github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33

WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
  github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
      github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33

Previous read at 0x00c0ba6e6c00 by goroutine 94463:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
  github.com/minio/minio/cmd.getBucketInfoLocal.func1()
      github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
  github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```

Probably back from #17701
2024-01-24 10:08:31 -08:00
Harshavardhana
52229a21cb
avoid reload of 'format.json' over the network under normal conditions (#18842) 2024-01-23 14:11:46 -08:00
Harshavardhana
dd2542e96c
add codespell action (#18818)
Original work here, #18474,  refixed and updated.
2024-01-17 23:03:17 -08:00
Harshavardhana
21d60eab7c
remove all older unused APIs (#18769) 2024-01-17 20:41:23 -08:00
Anis Eleuch
a47fc75c26
xl: Remove wrong wording for errCorruptedFormat (#18775)
Also add errCorruptedBackend to make it easier to differentiate between
corrupted content or something else wrong in the backend drive
2024-01-12 14:48:44 -08:00
Harshavardhana
e5c8794b8b
avoid disk monitoring leaks under various conditions (#18777)
- HealFormat() was leaking healthcheck goroutines for
  disks, we are only interested in enabling healthcheck
  for the newly formatted disk, not for existing disks.

- When disk is a root-disk a random disk monitor was
  leaking while we ignored the drive.

- When loading the disk for each erasure set, we were
  leaking goroutines for the prepare-storage.go disks
  which were replaced via the globalLocalDrives slice

- avoid disk monitoring utilizing health tokens that
  would cause exhaustion in the tokens, prematurely
  which were meant for incoming I/O. This is ensured
  by avoiding writing O_DIRECT aligned buffer instead
  write 2048 worth of content only as O_DSYNC, which is
  sufficient.
2024-01-12 01:48:36 -08:00
Anis Eleuch
3f4488c589
scanner: Allow full throttle if there is no parallel disk ops (#18109) 2024-01-02 13:51:24 -08:00
Harshavardhana
a50ea92c64
feat: introduce list_quorum="auto" to prefer quorum drives (#18084)
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups. 

To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap. 

Once this has been enabled, you need to also set `list_quorum`. 

```
~ mc admin config set alias/ api list_quorum=auto` 
```

A new debugging tool is available to check for any missing counters.
2023-12-29 15:52:41 -08:00
Poorna
e79b289325
fix datadir missing check on HeadObject (#18646)
versions pending purge in replication were seeing a errFileCorrupt
that prevents permanent deletion after replication.

Regression from PR#18477
2023-12-13 14:54:01 -08:00
Harshavardhana
d521c84d55
reduce logging during permission denied errors (#18641)
log them if any only once
2023-12-12 16:11:17 -08:00
Harshavardhana
65f34cd823
fix: remove ODirectReader entirely since we do not need it anymore (#18619) 2023-12-09 10:17:51 -08:00
Anis Eleuch
b7d11141e1
rename Force to Immediate for clarity (#18540) 2023-11-28 22:35:16 -08:00
Krishnan Parthasarathi
9fbd931058
Skip versions expired by DeleteAllVersionsAction (#18537)
Object versions expired by DeleteAllVersionsAction must not be included
toward data-usage accounting.
2023-11-28 08:39:21 -08:00
Anis Eleuch
9cb94eb4a9
cleaning up will delete instead of rename to trash with full disk err (#18534)
moveToTrash() function moves a folder to .trash, for example, when 
doing some object deletions: a data dir that has many parts will be 
renamed to the trash folder; However, ENOSPC is a valid error from 
rename(), and it can cripple a user trying to free some space in an 
entire disk situation.

Therefore, this commit will try to do a recursive delete in that case.
2023-11-27 17:36:02 -08:00
jiuker
be02333529
feat: drive sub-sys to max timeout reload (#18501) 2023-11-27 09:15:06 -08:00
Harshavardhana
fba883839d
feat: bring new HDD related performance enhancements (#18239)
Optionally allows customers to enable 

- Enable an external cache to catch GET/HEAD responses 
- Enable skipping disks that are slow to respond in GET/HEAD 
  when we have already achieved a quorum
2023-11-22 13:46:17 -08:00
Harshavardhana
a4cfb5e1ed
return errors if dataDir is missing during HeadObject() (#18477)
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
2023-11-20 21:33:47 -08:00
Klaus Post
51aa59a737
perf: websocket grid connectivity for all internode communication (#18461)
This PR adds a WebSocket grid feature that allows servers to communicate via 
a single two-way connection.

There are two request types:

* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
  roundtrips with small payloads.

* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
  which allows for different combinations of full two-way streams with an initial payload.

Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.

Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.

If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte 
slices or use a higher-level generics abstraction.

There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.

The request path can be changed to a new one for any protocol changes.

First, all servers create a "Manager." The manager must know its address 
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.

```
func (m *Manager) Connection(host string) *Connection
```

All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight 
requests and responses must also be given for streaming requests.

The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.

* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
   performs a single request and returns the result. Any deadline provided on the request is
   forwarded to the server, and canceling the context will make the function return at once.

* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
   will initiate a remote call and send the initial payload.

```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
	// Responses from the remote server.
	// Channel will be closed after an error or when the remote closes.
	// All responses *must* be read by the caller until either an error is returned or the channel is closed.
	// Canceling the context will cause the context cancellation error to be returned.
	Responses <-chan Response

	// Requests sent to the server.
	// If the handler is defined with 0 incoming capacity this will be nil.
	// Channel *must* be closed to signal the end of the stream.
	// If the request context is canceled, the stream will no longer process requests.
	Requests chan<- []byte
}

type Response struct {
	Msg []byte
	Err error
}
```

There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
2023-11-20 17:09:35 -08:00
Harshavardhana
91d8bddbd1
use sendfile/splice implementation to perform DMA (#18411)
sendfile implementation to perform DMA on all platforms

Go stdlib already supports sendfile/splice implementations
for

- Linux
- Windows
- *BSD
- Solaris

Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation

The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.

There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.

With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
2023-11-10 10:10:14 -08:00
Harshavardhana
ac8c43fe9c
fix: allow missing hot-tier accounting (#18345) 2023-10-30 14:42:11 -07:00
Harshavardhana
877e0cac03
fix: tiering statistics handling a bug in clone() implementation (#18342)
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6

Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.

This PR also does some cleanup in terms of implementation,

fixes #18070
2023-10-30 09:59:51 -07:00
Harshavardhana
e1e33077e8
fix: tests and resync replication status (#18244) 2023-10-13 17:03:34 -07:00
Harshavardhana
409c391850
implement helpers to get relevant info instead of FileInfo() (#18228) 2023-10-12 15:29:59 -07:00
Aditya Manthramurthy
2b4531f069
fix: O_DIRECT is on only for multi-disk setups (#18194)
Disable it for single disk/unsupported platforms
2023-10-09 17:08:40 -07:00
Aditya Manthramurthy
4bda4e4e2b
fix: check for disk-level O_DIRECT support (#18173)
Disk level O_DIRECT support checking at xl storage initialization was
conditional on a config setting being enabled. (This never took effect
because config initialization happens after ObjectLayer is ready.) This
is not necessary as the config setting is dynamic - O_DIRECT should be
enabled via runtime config. So we need to do the disk level support
check regardless of the config setting.
2023-10-05 20:54:49 -06:00
Harshavardhana
cdeab19673
fix: always check error upon w.Close() in Write() (#18111)
not checking w.Close() can prematurely make us
think that the w.Write() actually succeeded, apparently
Write() may or may not return an error but sometimes
only during a Close() call to the fd we may see the
error from Write() propagate.

Fdatasync(w) on the FD would return an error requiring
Close() error handling is less of a concern, however it may
happen such that fdatasync() did not return an error, where
as Close() would.
2023-09-26 11:04:00 -07:00
Harshavardhana
ac3a19138a
fix: set scanning details locally to avoid cached values (#18092)
atomic variable results such as scanning must not use
cached values, instead rely on real-time information.
2023-09-25 08:26:29 -07:00
Harshavardhana
8b8be2695f
optimize mkdir calls to avoid base-dir Mkdir attempts (#18021)
Currently we have IOPs of these patterns

```
[OS] os.Mkdir play.min.io:9000 /disk1 2.718µs
[OS] os.Mkdir play.min.io:9000 /disk1/data 2.406µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys 4.068µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp 2.843µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp/d89c8ceb-f8d1-4cc6-b483-280f87c4719f 20.152µs
```

It can be seen that we can save quite Nx levels such as
if your drive is mounted at `/disk1/minio` you can simply
skip sending an `Mkdir /disk1/` and `Mkdir /disk1/minio`.

Since they are expected to exist already, this PR adds a way
for us to ignore all paths upto the mount or a directory which
ever has been provided to MinIO setup.
2023-09-13 08:14:36 -07:00
Harshavardhana
1df5e31706
optimize MRF replication queue to avoid memory leaks (#18007) 2023-09-11 20:59:11 -07:00
Anis Eleuch
41de53996b
heal: calculate the number of workers based on NRRequests (#17945) 2023-09-11 14:48:54 -07:00
Harshavardhana
3995355150
avoid repeated large allocations for large parts (#17968)
objects with 10,000 parts and many of them can
cause a large memory spike which can potentially
lead to OOM due to lack of GC.

with previous PR reducing the memory usage significantly
in #17963, this PR reduces this further by 80% under
repeated calls.

Scanner sub-system has no use for the slice of Parts(),
it is better left empty.

```
benchmark                            old ns/op     new ns/op     delta
BenchmarkToFileInfo/ToFileInfo-8     295658        188143        -36.36%

benchmark                            old allocs     new allocs     delta
BenchmarkToFileInfo/ToFileInfo-8     61             60             -1.64%

benchmark                            old bytes     new bytes     delta
BenchmarkToFileInfo/ToFileInfo-8     1097210       227255        -79.29%
```
2023-09-02 07:49:24 -07:00
Harshavardhana
8208bcb896
remove all unnecessary logging, logOnce when absolutely needed (#17965) 2023-09-01 16:19:18 -07:00
Poorna
b48bbe08b2
Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Harshavardhana
7cafdc0512
fix: skip access checks further for known buckets (#17934) 2023-08-28 15:16:41 -07:00
Harshavardhana
8a57b6bced
use renameat2 Linux extension syscall (#17757)
this is a faster and safer alternative
on newer kernel versions.
2023-08-27 09:57:11 -07:00
Harshavardhana
124e28578c
remove strict persistence requirements for List() .metacache objects (#17917)
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.

this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
2023-08-25 07:58:11 -07:00
Harshavardhana
dde1a12819
fix: validate incoming uploadID to be base64 encoded (#17865)
Bonus fixes include

- do not have to write final xl.meta (renameData) does this
  already, saves some IOPs.

- make sure to purge the multipart directory properly using
  a recursive delete, otherwise this can easily pile up and
  rely on the stale uploads cleanup.

fixes #17863
2023-08-17 09:37:55 -07:00
Harshavardhana
6e860b6dc5
count all versions as part of DeleteAllVersionsAction (#17821) 2023-08-09 08:55:19 -07:00
Harshavardhana
cb089dcb52
error out by default beyond 10000 versions per object (#17803)
```
You've exceeded the limit on the number of versions you can create on this object
```
2023-08-04 10:40:21 -07:00
Harshavardhana
0153f96a20
add deadlines for readMetadata() in listing (#17776)
Bonus: also skip spending time looking for xl.json

- Listing()
- Delete()
2023-08-01 21:52:31 -07:00
Harshavardhana
81be718674
fix: optimize DiskInfo() call avoid metrics when not needed (#17763) 2023-07-31 15:20:48 -07:00
Harshavardhana
73edd5b8fd
introduce 'mc admin config set alias/ api odirect=on' (#17753)
change disable_odirect=off -> odirect=on to make it
easier to understand, instead of making it double
negative.
2023-07-31 00:12:53 -07:00
Harshavardhana
f13cfcb83e
allow disabling O_DIRECT for write ops (#17751)
on really slow systems, O_DIRECT simply kills the drives
allow for a way to disable them.
2023-07-29 15:17:56 -07:00
Harshavardhana
731e03fe5a
add ReadFileStream deadline for disk call (#17745)
timeout the reader side if hung via disk max timeout
2023-07-28 15:37:53 -07:00
Harshavardhana
47dcfcbdd4
introduce deadlines on READ operations (#17724) 2023-07-27 07:33:05 -07:00
Harshavardhana
b28bcad11b
avoid Access() calls on known bucket paths (#17719) 2023-07-26 11:31:40 -07:00
Poorna
f95129894d
Use decrypted object size while computing object size summary (#17717)
Corrects an issue with encrypted versioned objects being reported under
`unversioned` bin in the object version histogram
2023-07-24 17:13:25 -07:00
Harshavardhana
14e1ace552
remove serializing WalkDir() across all buckets/prefixes on SSDs (#17707)
slower drives get knocked off because they are too slow via 
active monitoring, we do not need to block calls arbitrarily.

Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs

Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
2023-07-24 09:30:19 -07:00