Commit Graph

118 Commits

Author SHA1 Message Date
Klaus Post cc60d66909
Fix incremental usage accounting (#12871)
Remote caches were not returned correctly, so they would not get updated on save.

Furthermore make some tweaks for more reliable updates.

Invalidate bloom filter to ensure rescan.
2021-08-04 09:14:14 -07:00
Harshavardhana 035882d292
fix: remove parentIsObject() check (#12851)
we will allow situations such as

```
a/b/1.txt
a/b
```

and

```
a/b
a/b/1.txt
```

we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.

rest of the situations where the prefixes get
created across sets are supported as is.
2021-08-03 13:26:57 -07:00
Klaus Post d6a2fe02d3
Add admin file inspector (#12635)
Download files from *any* bucket/path as an encrypted zip file.

The key is included in the response but can be separated so zip 
and the key doesn't have to be sent on the same channel.

Requires https://github.com/minio/pkg/pull/6
2021-07-09 11:29:16 -07:00
Harshavardhana da74e2f167
move internal/net to pkg/net package (#12505) 2021-06-14 14:54:37 -07:00
Anis Elleuch 6c8be64cdb
rest: healthcheck should not update failure metrics (#12458)
Otherwise, we can see high numbers of networking issues when a node is
down.
2021-06-08 14:09:26 -07:00
Harshavardhana 1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Klaus Post 2ca9c533ef
feat: implement in-progress partial bucket updates (#12279) 2021-05-19 14:38:30 -07:00
Anis Elleuch 56d4d7b8b1
MRF: Better detection of non stable disks (#12252)
MRF does not detect when a node is disconnected and reconnected quickly
this change will ensure that MRF is alerted by comparing the last disk
reconnection timestamp with the last MRF check time.

Signed-off-by: Anis Elleuch <anis@min.io>

Co-authored-by: Klaus Post <klauspost@gmail.com>
2021-05-11 09:19:15 -07:00
Harshavardhana e84f533c6c
add missing wait groups for certain io.Pipe() usage (#12264)
wait groups are necessary with io.Pipes() to avoid
races when a blocking function may not be expected
and a Write() -> Close() before Read() races on each
other. We should avoid such situations..

Co-authored-by: Klaus Post <klauspost@gmail.com>
2021-05-11 09:18:37 -07:00
Harshavardhana 336c8ac99f
fix: do not heal when disks are down (#12186)
HeadObject() was erroneously attempting
a heal when disks are down, avoid it.
2021-04-29 09:54:16 -07:00
Harshavardhana d501c5e38b
add missing responseBody drain (#12147)
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-26 08:59:54 -07:00
Harshavardhana 069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Harshavardhana bb1198c2c6
revert CreateFile waitForResponse (#12124)
instead use expect continue timeout, and have
higher response header timeout, the new higher
timeout satisfies worse case scenarios for total
response time on a CreateFile operation.

Also set the "expect" continue header to satisfy
expect continue timeout behavior.

Some clients seem to cause CreateFile body to be
truncated, leading to no errors which instead
fails with ObjectNotFound on a PUT operation,
this change avoids such failures appropriately.

Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 10:18:18 -07:00
Harshavardhana 2ef824bbb2
collapse two distinct calls into single RenameData() call (#12093)
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
2021-04-20 10:44:39 -07:00
Harshavardhana d46386246f
api: Introduce metadata update APIs to update only metadata (#11962)
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.

This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.

This helps reduce the chattiness for following calls..

- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
2021-04-04 13:32:31 -07:00
Harshavardhana 3c571472e0
avoid network read errors crashing CreateFile call (#11939)
Thanks to @dvaldivia for reproducing this
2021-03-31 18:44:45 -07:00
Harshavardhana 79564656eb
xl: CreateFile shouldn't prematurely timeout (#11878)
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
2021-03-24 09:05:03 -07:00
Harshavardhana 21cfc4aa49 Revert "xl: CreateFile shouldn't prematurely timeout (#11854)"
This reverts commit 922c7b57f5.
2021-03-23 23:47:45 -07:00
Harshavardhana 922c7b57f5
xl: CreateFile shouldn't prematurely timeout (#11854)
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
2021-03-22 18:25:05 -07:00
Harshavardhana d971061305
use listPathRaw for HealObjects() instead of expensive WalkVersions() (#11675) 2021-03-06 09:25:48 -08:00
Klaus Post fa9cf1251b
Imporve healing and reporting (#11312)
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
2021-03-04 14:36:23 -08:00
Harshavardhana 9171d6ef65
rename all references from crawl -> scanner (#11621) 2021-02-26 15:11:42 -08:00
Klaus Post 03172b89e2
Ensure cache has finished deserializing (#11620)
Make sure that response has been fully deserialized before returning.
2021-02-24 02:59:49 -08:00
Harshavardhana c31d2c3fdc
fix: CrawlAndGetDataUsage close pipe() before using a new one (#11600)
also additionally make sure errors during deserializer closes
the reader with right error type such that Write() end
actually see the final error, this avoids a waitGroup usage
and waiting.
2021-02-22 10:04:32 -08:00
Harshavardhana 79b6a43467
fix: avoid timed value for network calls (#11531)
additionally simply timedValue to have RWMutex
to avoid concurrent calls to DiskInfo() getting
serialized, this has an effect on all calls that
use GetDiskInfo() on the same disks.

Such as getOnlineDisks, getOnlineDisksWithoutHealing
2021-02-12 18:17:52 -08:00
Anis Elleuch b3f81e75f6
xl: Make it clear when to create delete marker for a non existant object (#11423) 2021-02-03 10:33:43 -08:00
Klaus Post 2680772d4b
Don't mark remotes online when shutting down (#11368)
Shutting down will mark remotes online when the shutdown has 
started since the context is canceled.

For example:

```
API: SYSTEM()
Time: 16:21:31 CET 01/28/2021
DeploymentID: 313b0065-c5a1-4aa3-9233-07223e77a730
Error: Storage resources are insufficient for the write operation .minio.sys/tmp/ced455c4-3d27-4bdd-95fc-b4707a179b8a/fd934ef3-8fc8-4330-abc1-f039fbbb9700/part.1 (cmd.InsufficientWriteQuorum)
       1: d:\minio\minio\cmd\data-usage.go:56:cmd.storeDataUsageInBackend()
Exiting on signal: INTERRUPT
Client http://127.0.0.1:9002/minio/lock/v5 online
Client http://127.0.0.1:9002/minio/storage/data/distxl/s2/d3/v24 online
Client http://127.0.0.1:9002/minio/storage/data/distxl/s2/d2/v24 online
Client http://127.0.0.1:9002/minio/storage/data/distxl/s2/d1/v24 online
Client http://127.0.0.1:9002/minio/peer/v12 online
Client http://127.0.0.1:9002/minio/storage/data/distxl/s2/d4/v24 online
```

Use a fresh context for health checks.
2021-01-28 13:38:12 -08:00
Harshavardhana f21d650ed4
fix: readData in bulk call using messagepack byte wrappers (#11228)
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.

After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.

Benefits are seen from almost all objects from

1KiB - 32KiB

Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.

All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
2021-01-07 19:27:31 -08:00
Harshavardhana d0027c3c41
do not use large buffers if not necessary (#11220)
without this change, there is a performance
regression for small objects GETs, this makes
the overall speed to go back to pre '59d363'
commit days.
2021-01-04 18:51:52 -08:00
Harshavardhana c4131c2798
feat: Small object optimization read data in single bulk call (#11207) 2021-01-03 11:27:57 -08:00
Anis Elleuch c9d502e6fa
parentDirIsObject() to return quickly with inexistant parent (#11204)
Rewrite parentIsObject() function. Currently if a client uploads
a/b/c/d, we always check if c, b, a are actual objects or not.

The new code will check with the reverse order and quickly quit if 
the segment doesn't exist.

So if a, b, c in 'a/b/c' does not exist in the first place, then returns
false quickly.
2021-01-02 12:01:29 -08:00
Anis Elleuch 677e80c0f8
xl: Remove check-dir in ReadVersion (#11200)
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.

This is causing an extract call to the disk 
which can be penalizing in case of busy system
where disks receive many concurrent access.
2021-01-02 10:35:57 -08:00
Harshavardhana 2eb52ca5f4
fix: heal bucket metadata right before healing bucket (#11097)
optimization mainly to avoid listing the entire
`.minio.sys/buckets/.minio.sys` directory, this
can get really huge and comes in the way of startup
routines, contents inside `.minio.sys/buckets/.minio.sys`
are rather transient and not necessary to be healed.
2020-12-13 11:57:08 -08:00
Klaus Post 4bca62a0bd
crawler: Stream bucket usage cache data (#11068)
Stream bucket caches to storage and through RPC calls.
2020-12-10 13:03:22 -08:00
Harshavardhana f794fe79e3
fix: network shutdown was not handle properly (#10927)
fixes a regression introduced in #10859, due
to the error returned by rest.Client being typed
i.e *rest.NetworkError - IsNetworkHostDown function
didn't work as expected to detect network issues.

This in-turn aggravated the situations when nodes
are disconnected leading to performance loss.
2020-11-19 13:53:49 -08:00
Harshavardhana 80b8ce89a4
remove context deadline from Delete calls (#10901) 2020-11-17 09:09:45 -08:00
Harshavardhana 17a5ff51ff
fix: move context timeout closer to network for Delete calls (#10897)
allowing for disconnects to be limited to the drive
themselves instead of disconnecting all drives.
2020-11-13 16:56:45 -08:00
Klaus Post 06899210a7
Reduce health check output (#10859)
This will make the health check clients 'silent'.
Use `IsNetworkOrHostDown` determine if network is ok so it mimics the functionality in the actual client.
2020-11-10 09:28:23 -08:00
Harshavardhana 1a1f00fa15
fix: use internode data for DisksInfo, VolsInfo in message pack (#10821)
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.

Additionally, remove all unused walkers from the storage layer
2020-11-04 10:10:54 -08:00
Klaus Post 1e11b4629f
Add remote Diskinfo caching (#10824)
Add 1 second remote disk info cache.

Should decrease need for remote calls a great deal due to how actively it is used now.
2020-11-04 08:00:18 -08:00
Klaus Post 37749f4623
Optimize FileInfo(Version) transfer (#10775)
File Info decoding, in particular, is showing up as a major 
allocator and time consumer for internode data transfers

Switch to message pack for cross-server transfers:

```
MSGP:

Size: 945 bytes

BenchmarkEncodeFileInfoMsgp-32    	 1558444	       866 ns/op	   1.16 MB/s	       0 B/op	       0 allocs/op
BenchmarkDecodeFileInfoMsgp-32    	  479968	      2487 ns/op	   0.40 MB/s	     848 B/op	      18 allocs/op

GOB:

Size: 1409 bytes

BenchmarkEncodeFileInfoGOB-32    	  333339	      3237 ns/op	   0.31 MB/s	     576 B/op	      19 allocs/op
BenchmarkDecodeFileInfoGOB-32    	   20869	     57837 ns/op	   0.02 MB/s	   16439 B/op	     428 allocs/op
```
2020-11-02 17:07:52 -08:00
Klaus Post 86e0d272f3
Reduce WriteAll allocs (#10810)
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers 
used by `io.CopyBuffer` even if they are pooled.

Since all writers appear to write byte buffers, just send those 
instead and write directly. The files are opened through the `os` 
package so they have no special properties anyway.

This removes the alloc and copy for each operation.

REST sends content length so a precise alloc can be made.
2020-11-02 16:14:31 -08:00
Harshavardhana 4c773f7068
re-use remote transports in Peer,Storage,Locker clients (#10788)
use one transport for internode communication
2020-11-02 07:43:11 -08:00
Klaus Post e63a44b734
rest client: Expect context timeouts for locks (#10782)
Add option for rest clients to not mark a remote offline for context timeouts.

This can be used if context timeouts are expected on the call.
2020-10-29 09:52:11 -07:00
Klaus Post a982baff27
ListObjects Metadata Caching (#10648)
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c

Gist of improvements:

* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no 
  longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
2020-10-28 09:18:35 -07:00
Anis Elleuch eb95353cb1
fix: Get/HeadObject return 404 on non quorum objects (#10753) 2020-10-26 10:30:46 -07:00
Harshavardhana 029758cb20
fix: retain the previous UUID for newly replaced drives (#10759)
only newly replaced drives get the new `format.json`,
this avoids disks reloading their in-memory reference
format, ensures that drives are online without
reloading the in-memory reference format.

keeping reference format in-tact means UUIDs
never change once they are formatted.
2020-10-26 10:29:29 -07:00
Harshavardhana 71b97fd3ac
fix: connect disks pre-emptively during startup (#10669)
connect disks pre-emptively upon startup, to ensure we have
enough disks are connected at startup rather than wait
for them.

we need to do this to avoid long wait times for server to
be online when we have servers come up in rolling upgrade
fashion
2020-10-13 18:28:42 -07:00
Harshavardhana 1f9abbee4d
make sure to release locks upon timeout (#10596)
fixes #10418
2020-09-29 15:18:34 -07:00
Harshavardhana 00eb6f6bc9
cache DiskInfo at storage layer for performance (#10586)
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.

Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
2020-09-29 09:54:41 -07:00