This commit will fix one rare case of a multipart object that
can be read in theory but GetObject API returned an error.
It turned out that a six years old code was marking a drive offline
when the bitrot streaming fails to read a part in a disk with any error.
This can affect reading a subsequent part, though having enough shards,
but unable to construct because one drive was marked offline earlier.
This commit will remove the drive marking offline code. It will also
close the bitrotstreaming reader before marking it as nil.
Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.
This will add an output buffer for encoded shards to be written to disk - potentially via RPC.
This will remove blocking when `(*streamingBitrotWriter).Write` is called, and it writes hashes and data.
With current settings, the write looks like this:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ Parr. │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have
been delivered to the TCP buffer, and the next Read hits the Pipe.
Then we write the shard data. This will typically be bigger than 64KB, so it will block until two blocks
have been read from the pipe.
When we insert a ring buffer:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a
memcopy. Reads can fill the 64KB buffer if there is data for it.
If the network is congested, the ring buffer will become filled, and all syscalls will be on full buffers.
Only when the ring buffer is filled will erasure coding start blocking.
Since there is always "space" to write output data, we remove the parallel writing since we are
always writing to memory now, and the goroutine synchronization overhead probably not worth taking.
If the output were blocked in the existing, we would still wait for it to unblock in parallel write, so it would
make no difference there - except now the ring buffer smoothes out the load.
There are some micro-optimizations we could look at later. The biggest is that, in most cases,
we could encode directly to the ring buffer - if we are not at a boundary. Also, "force filling" the
Read requests (i.e., blocking until a full read can be completed) could be investigated and maybe
allow concurrent memory on read and write.
for actionable, inspections we have `mc support inspect`
we do not need double logging, healing will report relevant
errors if any, in terms of quorum lost etc.
Large clusters with multiple sets, or multi-pool setups at times might
fail and report unexpected "file not found" errors. This can become
a problem during startup sequence when some files need to be created
at multiple locations.
- This PR ensures that we nil the erasure writers such that they
are skipped in RenameData() call.
- RenameData() doesn't need to "Access()" calls for `.minio.sys`
folders they always exist.
- Make sure PutObject() never returns ObjectNotFound{} for any
errors, make sure it always returns "WriteQuorum" when renameData()
fails with ObjectNotFound{}. Return appropriate errors for all
other cases.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
- GetObject() should always use a common dataDir to
read from when it starts reading, this allows the
code in erasure decoding to have sane expectations.
- Healing should always heal on the common dataDir, this
allows the code in dangling object detection to purge
dangling content.
These both situations can happen under certain types of
retries during PUT when server is restarting etc, some
namespace entries might be left over.
Previously we added heal trigger when bit-rot checks
failed, now extend that to support heal when parts
are not found either. This healing gets only triggered
if we can successfully decode the object i.e read
quorum is still satisfied for the object.
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.
This is causing an extract call to the disk
which can be penalizing in case of busy system
where disks receive many concurrent access.
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
If the requested server is part of the set this will always read
from the local disk, even if the disk contains a parity shard.
In default setup there is a 50% chance that at least
one shard that otherwise would have been fetched remotely
will be read locally instead.
It basically trades RPC call overhead for reed-solomon.
On distributed localhost this seems to be fairly break-even,
with a very small gain in throughput and latency.
However on networked servers this should be a bigger
1MB objects, before:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 76257:
* Avg: 25ms 50%: 24ms 90%: 32ms 99%: 42ms Fastest: 7ms Slowest: 67ms
* First Byte: Average: 23ms, Median: 22ms, Best: 5ms, Worst: 65ms
Throughput:
* Average: 1213.68 MiB/s, 1272.63 obj/s (59.948s, starting 14:45:44 CEST)
```
After:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 78845:
* Avg: 24ms 50%: 24ms 90%: 31ms 99%: 39ms Fastest: 8ms Slowest: 62ms
* First Byte: Average: 22ms, Median: 21ms, Best: 6ms, Worst: 57ms
Throughput:
* Average: 1255.11 MiB/s, 1316.08 obj/s (59.938s, starting 14:43:58 CEST)
```
Bonus fix: Only ask for heal once on an object.
Simplify parallelReader.Read() which also fixes previous
implementation where it was returning before all the parallel
reading go-routines had terminated which caused race conditions.