Klaus Post
dca7cf7200
select: Support Parquet dates ( #11928 )
...
Pass schema to parser to support dates.
Fixes #11926
2021-04-03 08:25:19 -07:00
mailsmail
27eb4ae3bc
fix: sql cast function when converting to float ( #11817 )
2021-03-19 09:14:38 -07:00
Klaus Post
e5a1a2a974
s3 select: fix date_diff behavior ( #11786 )
...
Fixes #11785 and adds tests for samples given.
2021-03-15 14:15:52 -07:00
Klaus Post
952b0f111d
Update S2 compression ( #11753 )
...
Relevant updates:
* Less allocations on decode: https://github.com/klauspost/compress/pull/322
* Fixed rare out-of-bounds write on amd64.
* ARM64 decompression assembly. Around 2x output speed. https://github.com/klauspost/compress/pull/324
* Speed up decompression on non-assembly platforms. https://github.com/klauspost/compress/pull/328
Upgrade cpuid to match simdjson.
2021-03-10 09:41:29 -08:00
Harshavardhana
f53d1de87f
fix: missing data on multiple columns reading parquet ( #11499 )
...
fixes #11413
2021-02-10 08:49:48 -08:00
Klaus Post
19fb1086b2
select: Fix leak on compressed files ( #11302 )
...
Properly close gzip reader when done reading
fixes #11300
2021-01-19 17:51:46 -08:00
Harshavardhana
b43906f6ee
fix: docs typos and keywords
2020-12-23 11:59:20 -08:00
Klaus Post
02aecb2fc1
select: Check if CSV is valid utf8 ( #10991 )
...
Check if first block of data is valid utf8.
Fixes #10970
2020-12-01 09:09:06 -08:00
Harshavardhana
23e8390997
fix: Allow Walk to honor load balanced drives ( #10610 )
2020-10-01 20:24:34 -07:00
飞雪无情
4de88e87bb
os.SEEK_SET is deprecated,use io.SeekStart. ( #10563 )
2020-09-25 03:12:25 -07:00
Harshavardhana
0537a21b79
avoid concurrenct use of rand.NewSource ( #10543 )
2020-09-22 15:34:27 -07:00
Klaus Post
0987069e37
select: Fix integer conversion overflow ( #10437 )
...
Do not convert float value to integer if it will over/underflow.
The comparison cannot be `<=` since rounding may overflow it.
Fixes #10436
2020-09-08 15:56:11 -07:00
Harshavardhana
caad314faa
add ruleguard support, fix all the reported issues ( #10335 )
2020-08-24 12:11:20 -07:00
Klaus Post
adca28801d
feat: disable Parquet by default (breaking change) ( #9920 )
...
I have built a fuzz test and it crashes heavily in seconds and will OOM shortly after.
It seems like supporting Parquet is basically a completely open way to crash the
server if you can upload a file and run s3 select on it.
Until Parquet is more hardened it is DISABLED by default since hostile
crafted input can easily crash the server.
If you are in a controlled environment where it is safe to assume no hostile
content can be uploaded to your cluster you can safely enable Parquet.
To enable Parquet set the environment variable `MINIO_API_SELECT_PARQUET=on`
while starting the MinIO server.
Furthermore, we guard parquet by recover functions.
2020-08-18 10:23:28 -07:00
Bruce Wang
e464a5bfbc
Fix bug with fields that contain trimming spaces ( #10079 )
...
String x might contain trimming spaces. And it needs to be trimmed. For
example, in csv files, there might be trimming spaces in a field that
ought to meet a query condition that contains the value without
trimming spaces. This applies to both intCast and floatCast functions.
2020-07-21 12:57:09 -07:00
Anis Elleuch
778e9c864f
Move dependency from minio-go v6 to v7 ( #10042 )
2020-07-14 09:38:05 -07:00
Klaus Post
2e338e84cb
fix owhanging crashes in parquet S3 select ( #9921 )
2020-07-01 08:15:41 -07:00
Klaus Post
2d0f65a5e3
Add archived parquet as int. package ( #9912 )
...
Since github.com/minio/parquet-go is archived add it as internal package.
2020-06-25 07:31:16 -07:00
Harshavardhana
1bc32215b9
enable full linter across the codebase ( #9620 )
...
enable linter using golangci-lint across
codebase to run a bunch of linters together,
we shall enable new linters as we fix more
things the codebase.
This PR fixes the first stage of this
cleanup.
2020-05-18 09:59:45 -07:00
Frank Wessels
086be07bf5
Fix ndjson unsupported ( #9500 )
2020-05-01 08:06:29 -07:00
Klaus Post
e4900b99d7
s3 select: Infer types for comparison ( #9438 )
2020-04-24 13:02:59 -07:00
Anis Elleuch
9902c9baaa
sql: Add support of escape quote in CSV ( #9231 )
...
This commit modifies csv parser, a fork of golang csv
parser to support a custom quote escape character.
The quote escape character is used to escape the quote
character when a csv field contains a quote character
as part of data.
2020-04-01 15:39:34 -07:00
Klaus Post
8d98662633
re-implement data usage crawler to be more efficient ( #9075 )
...
Implementation overview:
https://gist.github.com/klauspost/1801c858d5e0df391114436fdad6987b
2020-03-18 16:19:29 -07:00
Anis Elleuch
35ecc04223
Support configurable quote character parameter in Select ( #8955 )
2020-03-13 22:09:34 -07:00
Harshavardhana
603cf2a8bb
fix: broken gzip handling with Select API ( #9128 )
...
This PR fixes a regression introduced in a1c7c9ea73
2020-03-12 15:34:11 -07:00
Aditya Manthramurthy
cec8cdb35e
S3Select: Handle array selection in from clause ( #9076 )
2020-03-10 22:34:58 -07:00
ebozduman
a1c7c9ea73
Matches s3 invalid compression format error for 'mc sql' ( #9067 )
2020-03-05 19:34:04 -08:00
Klaus Post
e4020fb41f
SIMDJSON S3 select input ( #8401 )
2020-02-13 14:03:52 -08:00
Anis Elleuch
de924605a1
Import CSV parser library ( #8927 )
...
The CSV library code is imported from Go 1.13.6
2020-02-07 16:25:36 +05:30
Bruce Wang
c476b27a65
Comment typo "index max" to "index map" ( #8700 )
2019-12-24 21:57:43 -08:00
Klaus Post
bf3a97d3aa
S3 Select: Concurrent LINES delimited json parsing ( #8610 )
...
The speedup is ~5x on a 6 core CPU
2019-12-09 06:55:31 -08:00
Klaus Post
f1e2e1cc9e
S3 Select: Mismatched types don't match ( #8608 )
...
When comparing for equality, if types cannot be matched, they don't match.
2019-12-06 07:24:41 -08:00
Harshavardhana
5d3d57c12a
Start using error wrapping with fmt.Errorf ( #8588 )
...
Use fatih/errwrap to fix all the code to use
error wrapping with fmt.Errorf()
2019-12-02 09:28:01 -08:00
Klaus Post
1c90a6bd49
S3 Select: Convert CSV data to JSON ( #8464 )
2019-11-09 09:10:35 -08:00
Klaus Post
26e760ee62
Fix JSON Close data race. ( #8486 )
...
The JSON stream library has no safe way of aborting while
Since we cannot expect the called to safely handle "Read" and "Close" calls we must handle this.
Also any Read error returned from upstream will crash the server. We preserve the errors and instead always return io.EOF upstream, but send the error on Close.
`readahead v1.3.1` handles Read after Close better.
Updates to `progressReader` is mostly to ensure safety.
Fixes #8481
2019-11-05 14:20:37 -08:00
Klaus Post
38e6d911ea
S3 Select: Detect full object ( #8456 )
...
Check if select is `SELECT s.* from S3Object s` and forward it to All
Fixes #8371 and makes this case run significantly faster.
2019-10-30 13:46:55 +05:30
Klaus Post
51456e6adc
Select: Support Square Bracket Lists ( #8457 )
...
Allows for S3 compatible `SELECT * from s3object s WHERE id IN [3,2]`
Fixes #8422
2019-10-30 11:34:40 +05:30
Harshavardhana
d48fd6fde9
Remove unusued params and functions ( #8399 )
2019-10-15 18:35:41 -07:00
Klaus Post
002ac82631
S3 Select: Add parser support for lists. ( #8329 )
2019-10-06 07:52:45 -07:00
Klaus Post
c1a17c2561
S3 Select: Aggregate AVG/SUM as float ( #8326 )
...
Force sum/average to be calculated as a float.
As noted in #8221
> run SELECT AVG(CAST (Score as int)) FROM S3Object on
```
Name,Score
alice,80
bob,81
```
> AWS S3 gives 80.5 and MinIO gives 80.
This also makes overflows much more unlikely.
2019-09-27 16:12:03 -07:00
Klaus Post
1c5b05c130
S3 select: Fix output conversion on select * ( #8303 )
...
Fixes #8268
2019-09-27 12:33:14 -07:00
Klaus Post
be313f1758
S3 Select: Workaround java buffer size ( #8312 )
...
Updates #7475
The Java implementation has a 128KB buffer and a message must be emitted before that is used. #7475 therefore limits the message size to 128KB. But up to 256 bytes are written to the buffer in each call. This means we must emit a message before shorter than 128KB.
Therefore we change the limit to 128KB minus 256 bytes.
2019-09-26 04:56:20 +05:30
Klaus Post
520552ffa9
S3 select: flush when reaching limit ( #8279 )
...
Add missing flush when reaching select limit.
2019-09-20 11:00:17 -07:00
Klaus Post
dac1cf5a9a
S3 Select: Parsing tweaks ( #8261 )
...
* Don't output empty lines.
* Trim whitespace from byte to int/float/bool conversions.
2019-09-17 17:21:23 -07:00
Klaus Post
c9b8bd8de2
S3 Select: optimize output ( #8238 )
...
Queue output items and reuse them.
Remove the unneeded type system in sql and just use the Go type system.
In best case this is more than an order of magnitude speedup:
```
BenchmarkSelectAll_1M-12 1 1841049400 ns/op 274299728 B/op 4198522 allocs/op
BenchmarkSelectAll_1M-12 14 84833400 ns/op 169228346 B/op 3146541 allocs/op
```
2019-09-17 05:56:27 +05:30
Klaus Post
017456df63
Wait clearing the close channel ( #8250 )
...
Close channel should not be nilled before goroutines have exited.
Fixes potential hang on closing.
2019-09-16 16:18:01 -07:00
Klaus Post
ddea0bdf11
Concurrent CSV parsing and reduce S3 select allocations ( #8200 )
...
```
CSV parsing, BEFORE:
BenchmarkReaderBasic-12 2842 407533 ns/op 397860 B/op 957 allocs/op
BenchmarkReaderReplace-12 2718 429914 ns/op 397844 B/op 957 allocs/op
BenchmarkReaderReplaceTwo-12 2718 435556 ns/op 397855 B/op 957 allocs/op
BenchmarkAggregateCount_100K-12 171 6798974 ns/op 16667102 B/op 308077 allocs/op
BenchmarkAggregateCount_1M-12 19 65657411 ns/op 168057743 B/op 3146610 allocs/op
BenchmarkSelectAll_10M-12 1 20882119900 ns/op 2758799896 B/op 41978762 allocs/op
CSV parsing, AFTER:
BenchmarkReaderBasic-12 3721 312549 ns/op 101920 B/op 338 allocs/op
BenchmarkReaderReplace-12 3776 318810 ns/op 101993 B/op 340 allocs/op
BenchmarkReaderReplaceTwo-12 3610 330967 ns/op 102012 B/op 341 allocs/op
BenchmarkAggregateCount_100K-12 295 4149588 ns/op 3553623 B/op 103261 allocs/op
BenchmarkAggregateCount_1M-12 30 37746503 ns/op 33827931 B/op 1049435 allocs/op
BenchmarkSelectAll_10M-12 1 17608495800 ns/op 1416504040 B/op 21007082 allocs/op
~ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkReaderBasic-12 407533 312549 -23.31%
BenchmarkReaderReplace-12 429914 318810 -25.84%
BenchmarkReaderReplaceTwo-12 435556 330967 -24.01%
BenchmarkAggregateCount_100K-12 6798974 4149588 -38.97%
BenchmarkAggregateCount_1M-12 65657411 37746503 -42.51%
BenchmarkSelectAll_10M-12 20882119900 17608495800 -15.68%
benchmark old allocs new allocs delta
BenchmarkReaderBasic-12 957 338 -64.68%
BenchmarkReaderReplace-12 957 340 -64.47%
BenchmarkReaderReplaceTwo-12 957 341 -64.37%
BenchmarkAggregateCount_100K-12 308077 103261 -66.48%
BenchmarkAggregateCount_1M-12 3146610 1049435 -66.65%
BenchmarkSelectAll_10M-12 41978762 21007082 -49.96%
benchmark old bytes new bytes delta
BenchmarkReaderBasic-12 397860 101920 -74.38%
BenchmarkReaderReplace-12 397844 101993 -74.36%
BenchmarkReaderReplaceTwo-12 397855 102012 -74.36%
BenchmarkAggregateCount_100K-12 16667102 3553623 -78.68%
BenchmarkAggregateCount_1M-12 168057743 33827931 -79.87%
BenchmarkSelectAll_10M-12 2758799896 1416504040 -48.66%
```
```
BenchmarkReaderHuge/97K-12 2200 540840 ns/op 184.32 MB/s 1604450 B/op 687 allocs/op
BenchmarkReaderHuge/194K-12 1522 752257 ns/op 265.04 MB/s 2143135 B/op 1335 allocs/op
BenchmarkReaderHuge/389K-12 1190 947858 ns/op 420.69 MB/s 3221831 B/op 2630 allocs/op
BenchmarkReaderHuge/778K-12 806 1472486 ns/op 541.61 MB/s 5201856 B/op 5187 allocs/op
BenchmarkReaderHuge/1557K-12 426 2575269 ns/op 619.36 MB/s 9101330 B/op 10233 allocs/op
BenchmarkReaderHuge/3115K-12 286 4034656 ns/op 790.66 MB/s 12397968 B/op 16099 allocs/op
BenchmarkReaderHuge/6230K-12 172 6830563 ns/op 934.05 MB/s 16008416 B/op 26844 allocs/op
BenchmarkReaderHuge/12461K-12 100 11409467 ns/op 1118.39 MB/s 22655163 B/op 48107 allocs/op
BenchmarkReaderHuge/24922K-12 66 19780395 ns/op 1290.19 MB/s 35158559 B/op 90216 allocs/op
BenchmarkReaderHuge/49844K-12 34 37282559 ns/op 1369.03 MB/s 60528624 B/op 174497 allocs/op
```
2019-09-13 14:18:35 -07:00
Yao Zongyou
18fedc67d5
friendly prompt for s3select MalformedXML error ( #8171 )
...
partly fix #7911
2019-09-09 21:33:27 -07:00
Yao Zongyou
ec9bfd3aef
speed up the performance of s3select on csv ( #7945 )
2019-08-31 00:07:40 -07:00
Kanagaraj M
12353caf35
Fix: Support Unicode delimiters in s3 select ( #7931 )
2019-07-17 19:10:17 +01:00