Commit Graph

81 Commits

Author SHA1 Message Date
Harshavardhana
43f973c4cf
fix: check for O_DIRECT support for reads and writes (#11331)
In-case user enables O_DIRECT for reads and backend does
not support it we shall proceed to turn it off instead
and print a warning. This validation avoids any unexpected
downtimes that users may incur.
2021-01-22 15:38:21 -08:00
Harshavardhana
d1a8f0b786
fix possible crashes on deleteMarker replication (#11308)
Delete marker can have `metaSys` set to nil, that
can lead to crashes after the delete marker has
been healed.

Additionally also fix isObjectDangling check
for transitioned objects, that do not have parts
should be treated similar to Delete marker.
2021-01-20 13:12:12 -08:00
Harshavardhana
b5049d541f
fix: reduce an extra readdir() attempted on non-legacy setups (#11301)
to verify moving content and preserving legacy content,
we have way to detect the objects through readdir()
this path is not necessary for most common cases on
newer setups, avoid readdir() to save multiple system
calls.

also fix the CheckFile behavior for most common
use case i.e without legacy format.
2021-01-19 10:01:06 -08:00
Harshavardhana
3ca6330661
fix: optimize parentDirIsObject by moving isObject to storage layer (#11291)
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.

Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.

Update tests to test multiple sets scenario
2021-01-18 12:25:22 -08:00
Harshavardhana
f903cae6ff
Support variable server pools (#11256)
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.

This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.

If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
2021-01-16 12:08:02 -08:00
Harshavardhana
1a5775e2e8
enable small and large file optimization (#11260)
- for large objects we found that 1MiB block for
  r/w respectively.
- for small objects we found that 128KiB block for
  r/w respectively.
2021-01-12 10:20:39 -08:00
Harshavardhana
e4e117faab
fix: enable xl.json to xl.meta only if legacy drive is found (#11255)
another optimization is renameLegacyMetadata() never needs
to validate bucket with os.Stat() again, leading to reduction
in one extra syscall.
2021-01-11 02:27:04 -08:00
Harshavardhana
4593b146be
fix: print errors only when metacache status has errors (#11248) 2021-01-08 16:52:19 +05:30
Harshavardhana
f21d650ed4
fix: readData in bulk call using messagepack byte wrappers (#11228)
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.

After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.

Benefits are seen from almost all objects from

1KiB - 32KiB

Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.

All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
2021-01-07 19:27:31 -08:00
Harshavardhana
76e2713ffe
fix: use buffers only when necessary for io.Copy() (#11229)
Use separate sync.Pool for writes/reads

Avoid passing buffers for io.CopyBuffer()
if the writer or reader implement io.WriteTo or io.ReadFrom
respectively then its useless for sync.Pool to allocate
buffers on its own since that will be completely ignored
by the io.CopyBuffer Go implementation.

Improve this wherever we see this to be optimal.

This allows us to be more efficient on memory usage.
```
   385  // copyBuffer is the actual implementation of Copy and CopyBuffer.
   386  // if buf is nil, one is allocated.
   387  func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
   388  	// If the reader has a WriteTo method, use it to do the copy.
   389  	// Avoids an allocation and a copy.
   390  	if wt, ok := src.(WriterTo); ok {
   391  		return wt.WriteTo(dst)
   392  	}
   393  	// Similarly, if the writer has a ReadFrom method, use it to do the copy.
   394  	if rt, ok := dst.(ReaderFrom); ok {
   395  		return rt.ReadFrom(src)
   396  	}
```

From readahead package
```
// WriteTo writes data to w until there's no more data to write or when an error occurs.
// The return value n is the number of bytes written.
// Any error encountered during the write is also returned.
func (a *reader) WriteTo(w io.Writer) (n int64, err error) {
	if a.err != nil {
		return 0, a.err
	}
	n = 0
	for {
		err = a.fill()
		if err != nil {
			return n, err
		}
		n2, err := w.Write(a.cur.buffer())
		a.cur.inc(n2)
		n += int64(n2)
		if err != nil {
			return n, err
		}
```
2021-01-06 09:36:55 -08:00
Harshavardhana
d0027c3c41
do not use large buffers if not necessary (#11220)
without this change, there is a performance
regression for small objects GETs, this makes
the overall speed to go back to pre '59d363'
commit days.
2021-01-04 18:51:52 -08:00
Harshavardhana
c4b1d394d6
erasure: avoid io.Copy in hotpaths to reduce allocation (#11213) 2021-01-03 16:27:34 -08:00
Harshavardhana
c4131c2798
feat: Small object optimization read data in single bulk call (#11207) 2021-01-03 11:27:57 -08:00
Anis Elleuch
c9d502e6fa
parentDirIsObject() to return quickly with inexistant parent (#11204)
Rewrite parentIsObject() function. Currently if a client uploads
a/b/c/d, we always check if c, b, a are actual objects or not.

The new code will check with the reverse order and quickly quit if 
the segment doesn't exist.

So if a, b, c in 'a/b/c' does not exist in the first place, then returns
false quickly.
2021-01-02 12:01:29 -08:00
Anis Elleuch
677e80c0f8
xl: Remove check-dir in ReadVersion (#11200)
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.

This is causing an extract call to the disk 
which can be penalizing in case of busy system
where disks receive many concurrent access.
2021-01-02 10:35:57 -08:00
Anis Elleuch
a317d220ed
xl-storage: Do not stat bucket assuming the object exists (#11201)
In HEAD/GET, only STAT the bucket if the 
object does not exist to return the correct
error response.
2021-01-01 09:44:36 -08:00
Harshavardhana
cc457f1798
fix: enhance logging in crawler use console.Debug instead of logger.Info (#11179) 2020-12-29 01:57:28 -08:00
Harshavardhana
445a9bd827
fix: heal optimizations in crawler to avoid multiple healing attempts (#11173)
Fixes two problems

- Double healing when bitrot is enabled, instead heal attempt
  once in applyActions() before lifecycle is applied.

- If applyActions() is successful and getSize() returns proper
  value, then object is accounted for and should be removed
  from the oldCache namespace map to avoid double heal attempts.
2020-12-28 10:31:00 -08:00
Harshavardhana
c19e6ce773
avoid a crash in crawler when lifecycle is not initialized (#11170)
Bonus for static buffers use bytes.NewReader instead of
bytes.NewBuffer, to use a more reader friendly implementation
2020-12-26 22:58:06 -08:00
Harshavardhana
a773cf48d8
fix: overlapping object and prefix rejected (#11130)
fixes #11129
2020-12-18 08:51:09 -08:00
Harshavardhana
3e83643320
lifecycle improvements and additional debug logging (#11096)
Bonus change fix browser assets
2020-12-13 12:05:54 -08:00
Anis Elleuch
f164085227
xl: Always set root disk to true in test environment (#11094)
Tests environments (go test or manual testing) should always consider
the passed disks are root disks and should not rely on disk.IsRootDisk()
function. The reason is that this latter can return a false negative
when called in a busy system. However, returning a false negative will
only occur in a testing environment and not in a production, so we can
accept this trade-off for now.
2020-12-12 16:10:07 -08:00
Harshavardhana
d8c1f93de6
reject mixed drive situations with drives on root disks (#11057)
till now we used to match the inode number of the root
drive and the drive path minio would use, if they match
we knew that its a root disk.

this may not be true in all situations such as running
inside a container environment where the container might
be mounted from a different partition altogether, root
disk detection might fail.
2020-12-09 00:27:02 -08:00
Ritesh H Shukla
038bcd9079
Add replication capacity metrics support in crawler (#10786) 2020-12-07 13:47:48 -08:00
Klaus Post
a896125490
Add crawler delay config + dynamic config values (#11018) 2020-12-04 09:32:35 -08:00
Harshavardhana
96c0ce1f0c
add support for tuning healing to make healing more aggressive (#11003)
supports `mc admin config set <alias> heal sleep=100ms` to
enable more aggressive healing under certain times.

also optimize some areas that were doing extra checks than
necessary when bitrotscan was enabled, avoid double sleeps
make healing more predictable.

fixes #10497
2020-12-02 11:12:00 -08:00
Harshavardhana
bdd094bc39
fix: avoid sending errors on missing objects on locked buckets (#10994)
make sure multi-object delete returned errors that are AWS S3 compatible
2020-11-28 21:15:45 -08:00
Harshavardhana
df93102235
fix: unwrapping issues with os.Is* functions (#10949)
reduces  3 stat calls, reducing the
overall startup time significantly.
2020-11-23 08:36:49 -08:00
Poorna Krishnamoorthy
39f3d5493b
Show Delete replication status header (#10946)
X-Minio-Replication-Delete-Status header shows the
status of the replication of a permanent delete of a version.

All GETs are disallowed and return 405 on this object version.
In the case of replicating delete markers.

X-Minio-Replication-DeleteMarker-Status shows the status 
of replication, and would similarly return 405.

Additionally, this PR adds reporting of delete marker event completion
and updates documentation
2020-11-21 23:48:50 -08:00
Poorna Krishnamoorthy
1ebf6f146a Add support for ILM transition (#10565)
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
2020-11-19 18:47:17 -08:00
Harshavardhana
9a34fd5c4a Revert "Revert "Add delete marker replication support (#10396)""
This reverts commit 267d7bf0a9.
2020-11-19 18:43:58 -08:00
Harshavardhana
267d7bf0a9 Revert "Add delete marker replication support (#10396)"
This reverts commit 50c10a5087.

PR is moved to origin/dev branch
2020-11-12 11:43:14 -08:00
Poorna Krishnamoorthy
50c10a5087
Add delete marker replication support (#10396)
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).

This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
2020-11-10 15:24:14 -08:00
Harshavardhana
fde3299bf3
re-use optimized readdir for isDirEmpty() (#10829)
reduces effective memory usage by an order
of magnitude, also increases performance for
small objects
2020-11-04 13:05:21 -08:00
Harshavardhana
1a1f00fa15
fix: use internode data for DisksInfo, VolsInfo in message pack (#10821)
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.

Additionally, remove all unused walkers from the storage layer
2020-11-04 10:10:54 -08:00
Klaus Post
37749f4623
Optimize FileInfo(Version) transfer (#10775)
File Info decoding, in particular, is showing up as a major 
allocator and time consumer for internode data transfers

Switch to message pack for cross-server transfers:

```
MSGP:

Size: 945 bytes

BenchmarkEncodeFileInfoMsgp-32    	 1558444	       866 ns/op	   1.16 MB/s	       0 B/op	       0 allocs/op
BenchmarkDecodeFileInfoMsgp-32    	  479968	      2487 ns/op	   0.40 MB/s	     848 B/op	      18 allocs/op

GOB:

Size: 1409 bytes

BenchmarkEncodeFileInfoGOB-32    	  333339	      3237 ns/op	   0.31 MB/s	     576 B/op	      19 allocs/op
BenchmarkDecodeFileInfoGOB-32    	   20869	     57837 ns/op	   0.02 MB/s	   16439 B/op	     428 allocs/op
```
2020-11-02 17:07:52 -08:00
Klaus Post
86e0d272f3
Reduce WriteAll allocs (#10810)
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers 
used by `io.CopyBuffer` even if they are pooled.

Since all writers appear to write byte buffers, just send those 
instead and write directly. The files are opened through the `os` 
package so they have no special properties anyway.

This removes the alloc and copy for each operation.

REST sends content length so a precise alloc can be made.
2020-11-02 16:14:31 -08:00
Krishna Srinivas
3a2f89b3c0
fix: add support for O_DIRECT reads for erasure backends (#10718) 2020-10-30 11:04:29 -07:00
Klaus Post
a982baff27
ListObjects Metadata Caching (#10648)
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c

Gist of improvements:

* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no 
  longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
2020-10-28 09:18:35 -07:00
Anis Elleuch
eb95353cb1
fix: Get/HeadObject return 404 on non quorum objects (#10753) 2020-10-26 10:30:46 -07:00
Anis Elleuch
00124c56d9
erasure: Commit data before xl.meta in RenameData() (#10734)
This will reduce the chance to have updated xl.meta without data.
2020-10-23 21:54:58 -07:00
Harshavardhana
2042d4873c
rename crawler config option to heal (#10678) 2020-10-14 13:51:51 -07:00
Klaus Post
03991c5d41
crawler: Remove waitForLowActiveIO (#10667)
Only use dynamic delays for the crawler. Even though the max wait was 1 second the number 
of waits could severely impact crawler speed.

Instead of relying on a global metric, we use the stateless local delays to keep the crawler 
running at a speed more adjusted to current conditions.

The only case we keep it is before bitrot checks when enabled.
2020-10-13 13:45:08 -07:00
Harshavardhana
a0d0645128
remove safeMode behavior in startup (#10645)
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.

allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.

In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
2020-10-09 09:59:52 -07:00
Harshavardhana
736e58dd68
fix: handle concurrent lockers with multiple optimizations (#10640)
- select lockers which are non-local and online to have
  affinity towards remote servers for lock contention

- optimize lock retry interval to avoid sending too many
  messages during lock contention, reduces average CPU
  usage as well

- if bucket is not set, when deleteObject fails make sure
  setPutObjHeaders() honors lifecycle only if bucket name
  is set.

- fix top locks to list out always the oldest lockers always,
  avoid getting bogged down into map's unordered nature.
2020-10-08 12:32:32 -07:00
Harshavardhana
2b4eb87d77
pick disks which are common maximally used (#10600)
further optimization to ensure that good disks
are always used for listing, other than healing
we only use disks that are maximally used.
2020-09-29 22:54:02 -07:00
Harshavardhana
00eb6f6bc9
cache DiskInfo at storage layer for performance (#10586)
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.

Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
2020-09-29 09:54:41 -07:00
Harshavardhana
66174692a2
add '.healing.bin' for tracking currently healing disk (#10573)
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.

There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
2020-09-28 19:39:32 -07:00
Harshavardhana
7f9498f43f
fix: ignore faulty drives and continue (#10511)
drives might return different types of errors
handle them individually, and for some errors
just log an error and continue
2020-09-18 12:09:05 -07:00
Klaus Post
34859c6d4b
Preallocate (safe) slices when we know the size (#10459) 2020-09-14 20:44:18 -07:00