Commit Graph

56 Commits

Author SHA1 Message Date
Klaus Post
961b0b524e
Do not require restart when a disk is unreachable during node boot (#18576)
A disk that is not able to initialize when an instance is started
will never have a handler registered, which means a user will
need to restart the node after fixing the disk;

This will also prevent showing the wrong 'upgrade is needed.'
error message in that case.

When the disk is still failing, print an error every 30 minutes;
Disk reconnection will be retried every 30 seconds.

Co-authored-by: Anis Elleuch <anis@min.io>
2023-12-01 12:01:14 -08:00
Klaus Post
51aa59a737
perf: websocket grid connectivity for all internode communication (#18461)
This PR adds a WebSocket grid feature that allows servers to communicate via 
a single two-way connection.

There are two request types:

* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
  roundtrips with small payloads.

* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
  which allows for different combinations of full two-way streams with an initial payload.

Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.

Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.

If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte 
slices or use a higher-level generics abstraction.

There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.

The request path can be changed to a new one for any protocol changes.

First, all servers create a "Manager." The manager must know its address 
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.

```
func (m *Manager) Connection(host string) *Connection
```

All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight 
requests and responses must also be given for streaming requests.

The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.

* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
   performs a single request and returns the result. Any deadline provided on the request is
   forwarded to the server, and canceling the context will make the function return at once.

* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
   will initiate a remote call and send the initial payload.

```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
	// Responses from the remote server.
	// Channel will be closed after an error or when the remote closes.
	// All responses *must* be read by the caller until either an error is returned or the channel is closed.
	// Canceling the context will cause the context cancellation error to be returned.
	Responses <-chan Response

	// Requests sent to the server.
	// If the handler is defined with 0 incoming capacity this will be nil.
	// Channel *must* be closed to signal the end of the stream.
	// If the request context is canceled, the stream will no longer process requests.
	Requests chan<- []byte
}

type Response struct {
	Msg []byte
	Err error
}
```

There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
2023-11-20 17:09:35 -08:00
Maxim Tkachenko
ec30bb89a4
simplify channel send() in WalkDir() (#18186) 2023-10-09 17:27:55 -07:00
Harshavardhana
b1c1f02132
use buffers for pathJoin, to re-use buffers. (#17960)
```
benchmark                        old ns/op     new ns/op     delta
BenchmarkPathJoin/PathJoin-8     79.6          55.3          -30.53%

benchmark                        old allocs     new allocs     delta
BenchmarkPathJoin/PathJoin-8     2              1              -50.00%

benchmark                        old bytes     new bytes     delta
BenchmarkPathJoin/PathJoin-8     48            24            -50.00%
```
2023-08-31 17:58:48 -07:00
Harshavardhana
7cafdc0512
fix: skip access checks further for known buckets (#17934) 2023-08-28 15:16:41 -07:00
Harshavardhana
0153f96a20
add deadlines for readMetadata() in listing (#17776)
Bonus: also skip spending time looking for xl.json

- Listing()
- Delete()
2023-08-01 21:52:31 -07:00
Harshavardhana
14e1ace552
remove serializing WalkDir() across all buckets/prefixes on SSDs (#17707)
slower drives get knocked off because they are too slow via 
active monitoring, we do not need to block calls arbitrarily.

Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs

Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
2023-07-24 09:30:19 -07:00
Kaan Kabalak
21fbe88e1f
Print certain log messages once per error (#17484) 2023-06-24 20:29:13 -07:00
Klaus Post
839b9c9271
Reduce allocations in Walkdir (#17036) 2023-04-15 10:25:25 -07:00
Harshavardhana
e0f4dd6027
remove unncessary logs from WalkDir(), PutObject() (#16818) 2023-03-15 11:52:23 -07:00
Harshavardhana
37134e42d4
ignore io.EOF, io.ErrUnexpectedEOF on xl.meta reads in WalkDir() (#16625) 2023-02-15 07:12:48 -08:00
Klaus Post
f713436dd0
Fix truncated list response on deleted replicated objects (#16504) 2023-01-30 09:13:53 -08:00
Harshavardhana
23b329b9df
remove gateway completely (#15929) 2022-10-24 17:44:15 -07:00
Klaus Post
eee1ce305c
When listing, do not count delete markers (#15689)
When limiting listing do not count delete, since they may be discarded.

Extend limit, since we may be discarding the forward-to marker.

Fix directories always being sent to resolve, since they didn't return as match.
2022-09-14 12:11:27 -07:00
Klaus Post
ff9a74b91f
Add fast max-keys=1 support for Listing (#15670)
Add a listing option to stop when the limit is reached.  
This can be used by stateless listings for fast results.
2022-09-09 08:13:06 -07:00
Klaus Post
037fe4afdc
Add listing block reuse (#15579)
When streaming results, pool metadata slices when sent.
2022-08-24 09:11:16 -07:00
Anis Elleuch
9201870f6c
Remove unnecessary code in WalkDir() (#15168)
Recalculating forward is useless. It is never used and it will be
computed again when calling scanDir() again.
2022-06-27 10:26:56 -07:00
Harshavardhana
5d23be6242
fix: ignore printing io.EOF during WalkDir() on concurrently modified objects (#15100)
fix: ignore print io.EOF during WalkDir() on concurrently modified objects
2022-06-17 08:23:47 -07:00
Klaus Post
b890bbfa63
Add local disk health checks (#14447)
The main goal of this PR is to solve the situation where disks stop 
responding to operations. This generally causes an FD build-up and 
eventually will crash the server.

This adds detection of hung disks, where calls on disk get stuck.

We add functionality to `xlStorageDiskIDCheck` where it keeps 
track of the number of concurrent requests on a given disk.

A total number of 100 operations are allowed. If this limit is reached 
we will block (but not reject) new requests, but we will monitor the 
state of the disk.

If no requests have been completed or updated within a 15-second 
window, we mark the disk as offline. Requests that are blocked will be 
unblocked and return an error as "faulty disk".

New requests will be rejected until the disk is marked OK again.

Once a disk has been marked faulty, a check will run every 5 seconds that 
will attempt to write and read back a file. As long as this fails the disk will 
remain faulty.

To prevent lots of long-running requests to mark the disk faulty we 
implement a callback feature that allows updating the status as parts 
of these operations are running.

We add a reader and writer wrapper that will update the status of each 
successful read/write operation. This should allow fine enough granularity 
that a slow, but still operational disk will not reach 15 seconds where 
50 operations have not progressed.

Note that errors themselves are not enough to mark a disk faulty. 
A nil (or io.EOF) error will mark a disk as "good".

* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.

* de-couple IsOnline() from disk health tracker

The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was

- disconnected from network we need to validate
  if the drive is "correct" and is the same drive
  which belongs to this server.

- drive was replaced we have to format it - we
  support hot swapping of the drives.

IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.

* return errFaultyDisk for DiskInfo() call

Co-authored-by: Harshavardhana <harsha@minio.io>

Possible future Improvements:

* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
2022-03-09 11:38:54 -08:00
Anis Elleuch
84c690cb07
storage: Use request.Form and avoid mux matching (#13858)
request.Form uses less memory allocation and avoids gorilla mux matching
with weird characters in parameters such as '\n'

- Remove Queries() to avoid matching
- Ensure r.ParseForm is called to populate fields
- Add a unit test for object names with '\n'
2021-12-09 08:38:46 -08:00
Klaus Post
c897b6a82d
fix: missing entries on first list resume (#13627)
On first list resume or when specifying a custom markers entries could be missed in rare cases.

Do conservative truncation of entries when forwarding.

Replaces #13619
2021-11-10 10:41:21 -08:00
Klaus Post
4f3317effe
Close stream on panic (#13605)
Always close streamHTTPResponse on panic on main thread to avoid 
write/flush after response handler has returned.
2021-11-08 08:41:27 -08:00
Harshavardhana
5ed781a330
check for context canceled after competing for locks (#13239)
once we have competed for locks, verify if the
context is still valid - this is to ensure that
we do not start readdir() or read() calls on the
drives on canceled connections.
2021-09-17 14:11:01 -07:00
Harshavardhana
66fcd02aa2
de-couple walkMu and walkReadMu for some granularity (#13231)
This commit brings two locks instead of single lock for
WalkDir() calls on top of c25816eabc.

The main reason is to avoid contention between readMetadata()
and ListDir() calls, ListDir() can take time on prefixes that
are huge for readdir() but this shouldn't end up blocking
all readMetadata() operations, this allows for more room for
I/O while not overly penalizing all listing operations.
2021-09-17 12:14:12 -07:00
Harshavardhana
0f01e7ef0f
fix: check for xl.meta as directory fallback (#13023)
Objects uploaded in this format for example

```
mc cp /etc/hosts alias/bucket/foo/bar/xl.meta
mc ls -r alias/bucket/foo/bar
```

Won't list the object, handle this scenario.
2021-08-21 00:12:29 -07:00
Klaus Post
c25816eabc
xl walk: Limit walk concurrent IO (#12885)
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.

We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.

Single list operations should remain largely unaffected.
2021-08-18 18:10:36 -07:00
Harshavardhana
ee028a4693
listObjects optimized to handle max-keys=1 when prefix is object (#13000)
Some applications albeit poorly written rather than using headObject
rely on listObjects to check for existence of object, this unusual
request always has prefix=(to actual object) and max-keys=1

handle this situation specially such that we can avoid readdir()
on the top level parent to avoid sorting and skipping, ensuring
that such type of listObjects() always behaves similar to a
headObject() call.
2021-08-18 18:05:05 -07:00
Harshavardhana
9c65168312
fix: all levels deep flat key match (#12996)
this addresses a regression from #12984
which only addresses flat key from single
level deep at bucket level.

added extra tests as well to cover all
these scenarios.
2021-08-18 07:40:53 -07:00
Harshavardhana
654a6e9871
always set the filter to skip navigating baseDir (#12984)
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.

also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
2021-08-17 07:43:24 -07:00
Klaus Post
89febdb3d6
Reuse small buffers (#12948)
When reading metadata allow reuse of buffers 
in certain cases. Take the low-hanging fruit.

Reduce GC overhead when listing.
2021-08-12 14:27:22 -07:00
Harshavardhana
a2cd3c9a1d
use ParseForm() to allow query param lookups once (#12900)
```
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkURLQueryForm
BenchmarkURLQueryForm-4         247099363                4.809 ns/op           0 B/op          0 allocs/op
BenchmarkURLQuery
BenchmarkURLQuery-4              2517624               462.1 ns/op           432 B/op          4 allocs/op
PASS
ok      github.com/minio/minio/cmd      3.848s
```
2021-08-07 22:43:01 -07:00
Harshavardhana
e124d88788
optimize listing operation concurrency (#12728)
- remove use of getOnlineDisks() instead rely on fallbackDisks()
  when disk return errors like diskNotFound, unformattedDisk
  use other fallback disks to list from, instead of paying the
  price for checking getOnlineDisks()

- optimize getDiskID() further to avoid large write locks when
  looking formatLastCheck time window

This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
2021-07-24 22:03:38 -07:00
Klaus Post
05aebc52c2
feat: Implement listing version 3.0 (#12605)
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-07-05 15:34:41 -07:00
Anis Elleuch
f30c996d48
trace: Add bucket/prefix to WalkDir() tracing (#12510)
Bonus, replace os.* API with os-instrumented.go
2021-06-15 14:34:26 -07:00
Harshavardhana
1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Harshavardhana
0287711dc9
fix: implement readMetadata common function for re-use (#12353)
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
2021-05-21 11:41:25 -07:00
Klaus Post
9d1b6fb37d
Add XL reader without data (#12351)
Add XL metadata reader that reads metadata only on larger files.

Use for scanning and listing for now.
2021-05-21 09:10:54 -07:00
Harshavardhana
d501c5e38b
add missing responseBody drain (#12147)
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-26 08:59:54 -07:00
Harshavardhana
069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Klaus Post
2623338dc5
Inline small file data in xl.meta file (#11758) 2021-03-29 17:00:55 -07:00
Klaus Post
9efcb9e15c
Fix listPathRaw/WalkDir cancelation (#11905)
In #11888 we observe a lot of running, WalkDir calls.

There doesn't appear to be any listerners for these calls, so they should be aborted.

Ensure that WalkDir aborts when upstream cancels the request.

Fixes #11888
2021-03-26 11:18:30 -07:00
Anis Elleuch
0eb146e1b2
add additional metrics per disk API latency, API call counts #11250)
```
mc admin info --json
```

provides these details, for now, we shall eventually 
expose this at Prometheus level eventually. 

Co-authored-by: Harshavardhana <harsha@minio.io>
2021-03-16 20:06:57 -07:00
Harshavardhana
9171d6ef65
rename all references from crawl -> scanner (#11621) 2021-02-26 15:11:42 -08:00
Harshavardhana
b517c791e9
[feat]: use DSYNC for xl.meta writes and NOATIME for reads (#11615)
Instead of using O_SYNC, we are better off using O_DSYNC
instead since we are only ever interested in data to be
persisted to disk not the associated filesystem metadata.

For reads we ask customers to turn off noatime, but instead
we can proactively use O_NOATIME flag to avoid atime updates
upon reads.
2021-02-24 00:14:16 -08:00
Klaus Post
c5b2a8441b
fix: faster healing when disk is replaced. (#11520) 2021-02-18 11:06:54 -08:00
Harshavardhana
c9b0f595b9
support directory objects in listing in certain scenarios (#11452)
When a directory object is presented as a `prefix`
param our implementation tend to only list objects
present common to the `prefix` than the `prefix` itself,
to mimic AWS S3 like flat key behavior this PR ensures
that if `prefix` is directory object, it should be
automatically considered to be part of the eventual
listing result.

fixes #11370
2021-02-05 10:12:25 -08:00
Harshavardhana
f71e192343
avoid listing an empty dir without __XLDIR__ (#11427)
```
minio server /tmp/disk{1...4}
mc mb myminio/testbucket/
mkdir -p /tmp/disk{1..4}/testbucket/test-prefix/
```

This would end up being listed in the current
master, this PR fixes this situation.

If a directory is a leaf dir we should it
being listed, since it cannot be deleted anymore
with DeleteObject, DeleteObjects() API calls
because we natively support directories now.

Avoid listing it and let healing purge this folder
eventually in the background.
2021-02-03 14:06:54 -08:00
Harshavardhana
445a9bd827
fix: heal optimizations in crawler to avoid multiple healing attempts (#11173)
Fixes two problems

- Double healing when bitrot is enabled, instead heal attempt
  once in applyActions() before lifecycle is applied.

- If applyActions() is successful and getSize() returns proper
  value, then object is accounted for and should be removed
  from the oldCache namespace map to avoid double heal attempts.
2020-12-28 10:31:00 -08:00
Klaus Post
e6ea5c2703
crawler: Missing folder heal check per set (#10876) 2020-12-01 12:07:39 -08:00
Harshavardhana
df93102235
fix: unwrapping issues with os.Is* functions (#10949)
reduces  3 stat calls, reducing the
overall startup time significantly.
2020-11-23 08:36:49 -08:00