Commit Graph

109 Commits

Author SHA1 Message Date
Klaus Post c7844fb1fb posix: cache disk ID for a short while (#8564)
`*posix.getDiskID()` takes up to 30% of all CPU due to the `os.Stat` call on `GET` calls.

Before:
```
Operation: GET - Concurrency: 12
* Average: 1333.97 MB/s, 1365.99 obj/s, 1365.98 ops ended/s (4m59.975s)
* First Byte: Average: 7.801487ms, Median: 7.9974ms, Best: 1.9822ms, Worst: 110.0021ms

Aggregated, split into 299 x 1s time segments:
* Fastest: 1453.50 MB/s, 1488.38 obj/s, 1492.00 ops ended/s (1s)
* 50% Median: 1360.47 MB/s, 1393.12 obj/s, 1393.00 ops ended/s (1s)
* Slowest: 978.68 MB/s, 1002.17 obj/s, 1004.00 ops ended/s (1s)
```

After:
```
Operation: GET - Concurrency: 12
* Average: 1706.07 MB/s, 1747.02 obj/s, 1747.01 ops ended/s (4m59.985s)
* First Byte: Average: 5.797886ms, Median: 5.9959ms, Best: 996.3µs, Worst: 84.0007ms

Aggregated, split into 299 x 1s time segments:
* Fastest: 1830.03 MB/s, 1873.96 obj/s, 1872.00 ops ended/s (1s)
* 50% Median: 1735.04 MB/s, 1776.68 obj/s, 1776.00 ops ended/s (1s)
* Slowest: 994.94 MB/s, 1018.82 obj/s, 1018.00 ops ended/s (1s)
```

TLDR; `os.Stat` is not free.
2019-11-29 02:57:14 -08:00
Klaus Post 890b493a2e Use random file name for write check (#8563)
Since there may be multiple writes going on concurrently
Use a random file name for the write check to avoid collisions.
2019-11-22 09:50:17 -08:00
Klaus Post 1dd38750f7 Remove read-ahead for small files (#8522)
We should only read ahead if we are reading big files. We enable it for files >= 16MB.

Benchmark on 64KB objects.

Before:

```
Operation: GET
Errors: 0
Average: 59.976s, 87.13 MB/s, 1394.07 ops ended/s.
Fastest: 1s, 90.99 MB/s, 1455.00 ops ended/s.
50% Median: 1s, 87.53 MB/s, 1401.00 ops ended/s.
Slowest: 1s, 81.39 MB/s, 1301.00 ops ended/s.
```

After:

```
Operation: GET
Errors: 0
Average: 59.992s, 207.99 MB/s, 3327.85 ops ended/s.
Fastest: 1s, 219.20 MB/s, 3507.00 ops ended/s.
50% Median: 1s, 210.54 MB/s, 3368.00 ops ended/s.
Slowest: 1s, 179.14 MB/s, 2865.00 ops ended/s.
```

The 64KB buffer is actually a small disadvantage for this case, but I believe it will be better in general than no buffer.
2019-11-14 12:58:41 -08:00
Praveen raj Mani fa325665b1 Do not append the endpoint for fs/xl disks in StorageInfo (#8472) 2019-10-31 09:13:54 -07:00
Krishna Srinivas 980bf78b4d Detect underlying disk mount/unmount (#8408) 2019-10-25 10:37:53 -07:00
Praveen raj Mani 8836d57e3c The prometheus metrics refractoring (#8003)
The measures are consolidated to the following metrics

- `disk_storage_used` : Disk space used by the disk.
- `disk_storage_available`: Available disk space left on the disk.
- `disk_storage_total`: Total disk space on the disk.
- `disks_offline`: Total number of offline disks in current MinIO instance.
- `disks_total`: Total number of disks in current MinIO instance.
- `s3_requests_total`: Total number of s3 requests in current MinIO instance.
- `s3_errors_total`: Total number of errors in s3 requests in current MinIO instance.
- `s3_requests_current`: Total number of active s3 requests in current MinIO instance.
- `internode_rx_bytes_total`: Total number of internode bytes received by current MinIO server instance.
- `internode_tx_bytes_total`: Total number of bytes sent to the other nodes by current MinIO server instance.
- `s3_rx_bytes_total`: Total number of s3 bytes received by current MinIO server instance.
- `s3_tx_bytes_total`: Total number of s3 bytes sent by current MinIO server instance.
- `minio_version_info`: Current MinIO version with commit-id.
- `s3_ttfb_seconds_bucket`: Histogram that holds the latency information of the requests.

And this PR also modifies the current StorageInfo queries

- Decouples StorageInfo from ServerInfo .
- StorageInfo is enhanced to give endpoint information.

NOTE: ADMIN API VERSION IS BUMPED UP IN THIS PR

Fixes #7873
2019-10-22 21:01:14 -07:00
Harshavardhana ff5bf51952 admin/heal: Fix deep healing to heal objects under more conditions (#8321)
- Heal if the part.1 is truncated from its original size
- Heal if the part.1 fails while being verified in between
- Heal if the part.1 fails while being at a certain offset

Other cleanups include make sure to flush the HTTP responses
properly from storage-rest-server, avoid using 'defer' to
improve call latency. 'defer' incurs latency avoid them
in our hot-paths such as storage-rest handlers.

Fixes #8319
2019-10-02 01:42:15 +05:30
Harshavardhana 975134e42b
Add checks in DiskInfo() to protect against changing mounts (#8286) 2019-09-23 15:16:55 -07:00
Anis Elleuch 3f258062d8 bitrot: Verify file size inside storage interface (#7932) 2019-09-12 02:19:53 +05:30
Harshavardhana a7be313230 Start using new errors package (#8207) 2019-09-11 22:51:43 +05:30
Krishna Srinivas c38ada1a26 write() to disk in 4MB blocks for better performance (#7888) 2019-08-23 15:36:46 -07:00
Harshavardhana e6d8e272ce
Use const slashSeparator instead of "/" everywhere (#8028) 2019-08-06 12:08:58 -07:00
Harshavardhana e40c29e834 Fail appropriately if the disk has I/O errors (#7972)
If the disk has I/O errors, we should simply ignore
such a disk and not be bothered about it - until
it is replaced.
2019-07-25 13:35:27 -07:00
Anis Elleuch 000a60f238 xl: Heal empty parts (#7860)
posix.VerifyFile() doesn't know how to check if a file
is corrupted if that file is empty. We do have the part
size in xl.json so we pass it to VerifyFile to return
an error so healing empty parts can work properly.
2019-07-13 00:29:44 +01:00
Krishna Srinivas 58d90ed73c Avoid network transfer for bitrot verification during healing (#7375) 2019-07-08 13:51:18 -07:00
Harshavardhana 39b3e4f9b3 Avoid using io.ReadFull() for WriteAll and CreateFile (#7676)
With these changes we are now able to peak performances
for all Write() operations across disks HDD and NVMe.

Also adds readahead for disk reads, which also increases
performance for reads by 3x.
2019-05-22 13:47:15 -07:00
Krishnan Parthasarathi c871456269 File must be sync'd before closing (#7657)
- group sync and close action into a single defer statement to avoid
  evaluation order related bugs in future.
2019-05-16 18:30:51 -07:00
Harshavardhana b3f22eac56 Offload listing to posix layer (#7611)
This PR adds one API WalkCh which
sorts and sends list over the network

Each disk walks independently in a sorted manner.
2019-05-14 13:49:10 -07:00
Anis Elleuch 9c90a28546 Implement bulk delete (#7607)
Bulk delete at storage level in Multiple Delete Objects API

In order to accelerate bulk delete in Multiple Delete objects API,
a new bulk delete is introduced in storage layer, which will accept
a list of objects to delete rather than only one. Consequently,
a new API is also need to be added to Object API.
2019-05-13 12:25:49 -07:00
Harshavardhana 3eb7a8bde8
Sync before Close() to avoid random I/O (#7638) 2019-05-11 15:03:10 -07:00
poornas cf2a436bc8 Show SlowDown error message if backend is busy (#7521)
or if there are too many open file descriptors.
2019-05-02 07:09:57 -07:00
Praveen raj Mani c113d4e49c Posix CreateFile should work for compressed lengths (#7584) 2019-04-30 16:27:31 -07:00
Krishna Srinivas b93ef73f9b Fix divide by 0 error when directio.AlignSize is 0 (#7591) 2019-04-26 16:08:15 -07:00
Krishna Srinivas a3ec71bc28 Use O_DIRECT while writing to disk (#7479)
- Use O_DIRECT while writing to disk
- Remove MINIO_DRIVE_SYNC option
2019-04-23 21:25:06 -07:00
Harshavardhana f767a2538a
Optimize listing with leaf check offloaded to posix (#7541)
Other listing optimizations include

- remove double sorting while filtering object entries
- improve error message when upload-id is not in quorum
- use jsoniter for full unmarshal json, instead of gjson
- remove unused code
2019-04-23 14:54:28 -07:00
kannappanr 5ecac91a55
Replace Minio refs in docs with MinIO and links (#7494) 2019-04-09 11:39:42 -07:00
Harshavardhana 4a698c731b HealObjects should remove objects without quorum (#7407)
This PR adds a way to list objects without quorum
such that they can purged by `mc admin heal --remove`
2019-03-26 14:57:44 -07:00
Kirill Motkov 3d29ab4059 Rewrite if-else chains to switch statements (#7382) 2019-03-18 07:46:20 -07:00
Harshavardhana 6702d23d52 Simplify ReadFileStream closer, make sure to flush all HTTP responses (#7374) 2019-03-18 10:50:26 +05:30
Krishna Srinivas 6dd26b8231 Detect change in underlying mounted disks (#7229) 2019-02-20 13:32:29 -08:00
Harshavardhana df35d7db9d Introduce staticcheck for stricter builds (#7035) 2019-02-13 18:29:36 +05:30
Krishna Srinivas 6f08edfb36 Use O_EXCL when creating file as we never overwrite an existing file (#7189) 2019-02-01 19:01:06 -08:00
Krishna Srinivas 82af0be1aa Healing process should not heal root disk (#7089) 2019-01-23 15:29:29 -08:00
Harshavardhana 8e0910ab3e Fix build issues on BSDs in pkg/cpu (#7116)
Also add a cross compile script to test always cross
compilation for some well known platforms and architectures
, we support out of box compilation of these platforms even
if we don't make an official release build.

This script is to avoid regressions in this area when we
add platform dependent code.
2019-01-22 09:27:23 +05:30
Krishna Srinivas 98c950aacd Streaming bitrot verification support (#7004) 2019-01-17 18:28:18 +05:30
poornas 5a80cbec2a Add double encryption at S3 gateway. (#6423)
This PR adds pass-through, single encryption at gateway and double
encryption support (gateway encryption with pass through of SSE
headers to backend).

If KMS is set up (either with Vault as KMS or using
MINIO_SSE_MASTER_KEY),gateway will automatically perform
single encryption. If MINIO_GATEWAY_SSE is set up in addition to
Vault KMS, double encryption is performed.When neither KMS nor
MINIO_GATEWAY_SSE is set, do a pass through to backend.

When double encryption is specified, MINIO_GATEWAY_SSE can be set to
"C" for SSE-C encryption at gateway and backend, "S3" for SSE-S3
encryption at gateway/backend or both to support more than one option.

Fixes #6323, #6696
2019-01-05 14:16:42 -08:00
Harshavardhana b9b353db4b Add env to support synchronous ops for all calls (#6877) 2018-12-11 16:22:56 -08:00
Nitish Tiwari 2a810c7da2
Improve du thread performance (#6849) 2018-11-26 10:35:14 +05:30
Harshavardhana f1f23f6f11 Add sync mode for 'xl.json' (#6798)
xl.json is the source of truth for all erasure
coded objects, without which we won't be able to
read the objects properly. This PR enables sync
mode for writing `xl.json` such all writes go hit
the disk and are persistent under situations such
as abrupt power failures on servers running Minio.
2018-11-14 19:48:35 +05:30
Praveen raj Mani ce9d36d954 Add object compression support (#6292)
Add support for streaming (golang/LZ77/snappy) compression.
2018-09-28 09:06:17 +05:30
Anis Elleuch 7571582000 Print storage errors during distributed initialization (#6441)
This commit will print connection failures to other disks in other nodes
after 5 retries. It is useful for users to understand why the
distribued cluster fails to boot up.
2018-09-10 16:21:59 -07:00
Krishna Srinivas ce02ab613d Simplify erasure code by separating bitrot from erasure code (#5959) 2018-08-06 15:14:08 -07:00
Oleg Kovalov 37de2dbd3b simplifying if-else chains to switches (#6208) 2018-08-06 10:26:40 -07:00
Harshavardhana ad86454580 Make sure to handle FaultyDisks in listing ops (#6204)
Continuing from PR 157ed65c35

Our posix.go implementation did not handle  I/O errors
properly on the disks, this led to situations where
top-level callers such as ListObjects might return early
without even verifying all the available disks.

This commit tries to address this in Kubernetes, drbd/nbd based
persistent volumes which can disconnect under load and
result in the situations with disks return I/O errors.

This commit also simplifies listing operation, listing
never returns any error. We can avoid this since we pretty
much ignore most of the errors anyways. When objects are
accessed directly we return proper errors.
2018-07-27 15:32:19 -07:00
Krishna Srinivas 40ed0d1f5d Support 1GB disk size (#6137)
Pivotal CF by default has 1GB disk option which causes minio to not start
2018-07-09 18:23:49 -07:00
Harshavardhana de251483d1 Avoid ticker timer to simplify disk usage (#6101)
This PR simplifies the code to avoid tracking
any running usage events. This PR also brings
in an upper threshold of upto 1 minute suspend
the usage function after which the usage would
proceed without waiting any longer.
2018-06-28 15:05:45 -07:00
Praveen raj Mani ea76e72054 Incorrect error message for insufficient volume fix (#6099)
Reply back with appropriate error message when the server is spawn
with volume of insufficient size (< 1GiB).

Fixes #5993.
2018-06-28 12:01:05 -07:00
Harshavardhana 25de775560 disable disk-usage when export is root mount path (#6091)
disk usage crawling is not needed when a tenant
is not sharing the same disk for multiple other
tenants. This PR adds an optimization when we
see a setup uses entire disk, we simply rely on
statvfs() to give us total usage.

This PR also additionally adds low priority
scheduling for usage check routine, such that
other go-routines blocked will be automatically
unblocked and prioritized before usage.
2018-06-27 18:59:38 -07:00
Ashish Kumar Sinha 0bbdd02a57 Updating disk storage for FS/Erasure mode (#6081)
Updating the disk storage stats for FS/Erasure coded backend
2018-06-25 10:46:48 -07:00
Harshavardhana cb9ee1584a Fix TestHealStartNStatusHandler sporadic failure (#6015)
Fixes #5818
2018-06-12 16:36:31 -07:00