Commit Graph

38 Commits

Author SHA1 Message Date
Klaus Post b890bbfa63
Add local disk health checks (#14447)
The main goal of this PR is to solve the situation where disks stop 
responding to operations. This generally causes an FD build-up and 
eventually will crash the server.

This adds detection of hung disks, where calls on disk get stuck.

We add functionality to `xlStorageDiskIDCheck` where it keeps 
track of the number of concurrent requests on a given disk.

A total number of 100 operations are allowed. If this limit is reached 
we will block (but not reject) new requests, but we will monitor the 
state of the disk.

If no requests have been completed or updated within a 15-second 
window, we mark the disk as offline. Requests that are blocked will be 
unblocked and return an error as "faulty disk".

New requests will be rejected until the disk is marked OK again.

Once a disk has been marked faulty, a check will run every 5 seconds that 
will attempt to write and read back a file. As long as this fails the disk will 
remain faulty.

To prevent lots of long-running requests to mark the disk faulty we 
implement a callback feature that allows updating the status as parts 
of these operations are running.

We add a reader and writer wrapper that will update the status of each 
successful read/write operation. This should allow fine enough granularity 
that a slow, but still operational disk will not reach 15 seconds where 
50 operations have not progressed.

Note that errors themselves are not enough to mark a disk faulty. 
A nil (or io.EOF) error will mark a disk as "good".

* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.

* de-couple IsOnline() from disk health tracker

The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was

- disconnected from network we need to validate
  if the drive is "correct" and is the same drive
  which belongs to this server.

- drive was replaced we have to format it - we
  support hot swapping of the drives.

IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.

* return errFaultyDisk for DiskInfo() call

Co-authored-by: Harshavardhana <harsha@minio.io>

Possible future Improvements:

* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
2022-03-09 11:38:54 -08:00
Anis Elleuch 84c690cb07
storage: Use request.Form and avoid mux matching (#13858)
request.Form uses less memory allocation and avoids gorilla mux matching
with weird characters in parameters such as '\n'

- Remove Queries() to avoid matching
- Ensure r.ParseForm is called to populate fields
- Add a unit test for object names with '\n'
2021-12-09 08:38:46 -08:00
Klaus Post c897b6a82d
fix: missing entries on first list resume (#13627)
On first list resume or when specifying a custom markers entries could be missed in rare cases.

Do conservative truncation of entries when forwarding.

Replaces #13619
2021-11-10 10:41:21 -08:00
Klaus Post 4f3317effe
Close stream on panic (#13605)
Always close streamHTTPResponse on panic on main thread to avoid 
write/flush after response handler has returned.
2021-11-08 08:41:27 -08:00
Harshavardhana 5ed781a330
check for context canceled after competing for locks (#13239)
once we have competed for locks, verify if the
context is still valid - this is to ensure that
we do not start readdir() or read() calls on the
drives on canceled connections.
2021-09-17 14:11:01 -07:00
Harshavardhana 66fcd02aa2
de-couple walkMu and walkReadMu for some granularity (#13231)
This commit brings two locks instead of single lock for
WalkDir() calls on top of c25816eabc.

The main reason is to avoid contention between readMetadata()
and ListDir() calls, ListDir() can take time on prefixes that
are huge for readdir() but this shouldn't end up blocking
all readMetadata() operations, this allows for more room for
I/O while not overly penalizing all listing operations.
2021-09-17 12:14:12 -07:00
Harshavardhana 0f01e7ef0f
fix: check for xl.meta as directory fallback (#13023)
Objects uploaded in this format for example

```
mc cp /etc/hosts alias/bucket/foo/bar/xl.meta
mc ls -r alias/bucket/foo/bar
```

Won't list the object, handle this scenario.
2021-08-21 00:12:29 -07:00
Klaus Post c25816eabc
xl walk: Limit walk concurrent IO (#12885)
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.

We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.

Single list operations should remain largely unaffected.
2021-08-18 18:10:36 -07:00
Harshavardhana ee028a4693
listObjects optimized to handle max-keys=1 when prefix is object (#13000)
Some applications albeit poorly written rather than using headObject
rely on listObjects to check for existence of object, this unusual
request always has prefix=(to actual object) and max-keys=1

handle this situation specially such that we can avoid readdir()
on the top level parent to avoid sorting and skipping, ensuring
that such type of listObjects() always behaves similar to a
headObject() call.
2021-08-18 18:05:05 -07:00
Harshavardhana 9c65168312
fix: all levels deep flat key match (#12996)
this addresses a regression from #12984
which only addresses flat key from single
level deep at bucket level.

added extra tests as well to cover all
these scenarios.
2021-08-18 07:40:53 -07:00
Harshavardhana 654a6e9871
always set the filter to skip navigating baseDir (#12984)
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.

also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
2021-08-17 07:43:24 -07:00
Klaus Post 89febdb3d6
Reuse small buffers (#12948)
When reading metadata allow reuse of buffers 
in certain cases. Take the low-hanging fruit.

Reduce GC overhead when listing.
2021-08-12 14:27:22 -07:00
Harshavardhana a2cd3c9a1d
use ParseForm() to allow query param lookups once (#12900)
```
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkURLQueryForm
BenchmarkURLQueryForm-4         247099363                4.809 ns/op           0 B/op          0 allocs/op
BenchmarkURLQuery
BenchmarkURLQuery-4              2517624               462.1 ns/op           432 B/op          4 allocs/op
PASS
ok      github.com/minio/minio/cmd      3.848s
```
2021-08-07 22:43:01 -07:00
Harshavardhana e124d88788
optimize listing operation concurrency (#12728)
- remove use of getOnlineDisks() instead rely on fallbackDisks()
  when disk return errors like diskNotFound, unformattedDisk
  use other fallback disks to list from, instead of paying the
  price for checking getOnlineDisks()

- optimize getDiskID() further to avoid large write locks when
  looking formatLastCheck time window

This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
2021-07-24 22:03:38 -07:00
Klaus Post 05aebc52c2
feat: Implement listing version 3.0 (#12605)
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-07-05 15:34:41 -07:00
Anis Elleuch f30c996d48
trace: Add bucket/prefix to WalkDir() tracing (#12510)
Bonus, replace os.* API with os-instrumented.go
2021-06-15 14:34:26 -07:00
Harshavardhana 1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Harshavardhana 0287711dc9
fix: implement readMetadata common function for re-use (#12353)
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
2021-05-21 11:41:25 -07:00
Klaus Post 9d1b6fb37d
Add XL reader without data (#12351)
Add XL metadata reader that reads metadata only on larger files.

Use for scanning and listing for now.
2021-05-21 09:10:54 -07:00
Harshavardhana d501c5e38b
add missing responseBody drain (#12147)
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-26 08:59:54 -07:00
Harshavardhana 069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Klaus Post 2623338dc5
Inline small file data in xl.meta file (#11758) 2021-03-29 17:00:55 -07:00
Klaus Post 9efcb9e15c
Fix listPathRaw/WalkDir cancelation (#11905)
In #11888 we observe a lot of running, WalkDir calls.

There doesn't appear to be any listerners for these calls, so they should be aborted.

Ensure that WalkDir aborts when upstream cancels the request.

Fixes #11888
2021-03-26 11:18:30 -07:00
Anis Elleuch 0eb146e1b2
add additional metrics per disk API latency, API call counts #11250)
```
mc admin info --json
```

provides these details, for now, we shall eventually 
expose this at Prometheus level eventually. 

Co-authored-by: Harshavardhana <harsha@minio.io>
2021-03-16 20:06:57 -07:00
Harshavardhana 9171d6ef65
rename all references from crawl -> scanner (#11621) 2021-02-26 15:11:42 -08:00
Harshavardhana b517c791e9
[feat]: use DSYNC for xl.meta writes and NOATIME for reads (#11615)
Instead of using O_SYNC, we are better off using O_DSYNC
instead since we are only ever interested in data to be
persisted to disk not the associated filesystem metadata.

For reads we ask customers to turn off noatime, but instead
we can proactively use O_NOATIME flag to avoid atime updates
upon reads.
2021-02-24 00:14:16 -08:00
Klaus Post c5b2a8441b
fix: faster healing when disk is replaced. (#11520) 2021-02-18 11:06:54 -08:00
Harshavardhana c9b0f595b9
support directory objects in listing in certain scenarios (#11452)
When a directory object is presented as a `prefix`
param our implementation tend to only list objects
present common to the `prefix` than the `prefix` itself,
to mimic AWS S3 like flat key behavior this PR ensures
that if `prefix` is directory object, it should be
automatically considered to be part of the eventual
listing result.

fixes #11370
2021-02-05 10:12:25 -08:00
Harshavardhana f71e192343
avoid listing an empty dir without __XLDIR__ (#11427)
```
minio server /tmp/disk{1...4}
mc mb myminio/testbucket/
mkdir -p /tmp/disk{1..4}/testbucket/test-prefix/
```

This would end up being listed in the current
master, this PR fixes this situation.

If a directory is a leaf dir we should it
being listed, since it cannot be deleted anymore
with DeleteObject, DeleteObjects() API calls
because we natively support directories now.

Avoid listing it and let healing purge this folder
eventually in the background.
2021-02-03 14:06:54 -08:00
Harshavardhana 445a9bd827
fix: heal optimizations in crawler to avoid multiple healing attempts (#11173)
Fixes two problems

- Double healing when bitrot is enabled, instead heal attempt
  once in applyActions() before lifecycle is applied.

- If applyActions() is successful and getSize() returns proper
  value, then object is accounted for and should be removed
  from the oldCache namespace map to avoid double heal attempts.
2020-12-28 10:31:00 -08:00
Klaus Post e6ea5c2703
crawler: Missing folder heal check per set (#10876) 2020-12-01 12:07:39 -08:00
Harshavardhana df93102235
fix: unwrapping issues with os.Is* functions (#10949)
reduces  3 stat calls, reducing the
overall startup time significantly.
2020-11-23 08:36:49 -08:00
Harshavardhana 70d2c2ccc9
skip files that are not erasure objects or directories (#10926)
without this change WalkDir reports errors while
trying to read `format.json/xl.meta` which is a
replicated file
2020-11-19 09:15:09 -08:00
Harshavardhana 9dea7020f0
allow prefix filtering for WalkDir to be optional (#10923) 2020-11-18 12:03:16 -08:00
Klaus Post 990d074f7d
metacache: Allow prefix filtering (#10920)
Do listings with prefix filter when bloom filter is dirty.

This will forward the prefix filter to the lister which will make it 
only scan the folders/objects with the specified prefix.

If we have a clean bloom filter we try to build a more generally 
useful cache so in that case, we will list all objects/folders.
2020-11-18 10:44:18 -08:00
Klaus Post b5a3d79bce
listobjectversions: Add shortcut for Veeam blocks (#10893)
Add shortcut for `APN/1.0 Veeam/1.0 Backup/10.0`

It requests unique blocks with a specific prefix. We skip 
scanning the parent directory for more objects matching the prefix.
2020-11-13 16:58:20 -08:00
Klaus Post a3017c724e
Sort directory objects correctly (#10886)
Decode dir objects when listing and sort them correctly.
2020-11-12 13:09:34 -08:00
Klaus Post a982baff27
ListObjects Metadata Caching (#10648)
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c

Gist of improvements:

* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no 
  longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
2020-10-28 09:18:35 -07:00