Klaus Post
02aecb2fc1
select: Check if CSV is valid utf8 ( #10991 )
...
Check if first block of data is valid utf8.
Fixes #10970
2020-12-01 09:09:06 -08:00
Harshavardhana
1bc32215b9
enable full linter across the codebase ( #9620 )
...
enable linter using golangci-lint across
codebase to run a bunch of linters together,
we shall enable new linters as we fix more
things the codebase.
This PR fixes the first stage of this
cleanup.
2020-05-18 09:59:45 -07:00
Anis Elleuch
9902c9baaa
sql: Add support of escape quote in CSV ( #9231 )
...
This commit modifies csv parser, a fork of golang csv
parser to support a custom quote escape character.
The quote escape character is used to escape the quote
character when a csv field contains a quote character
as part of data.
2020-04-01 15:39:34 -07:00
Anis Elleuch
35ecc04223
Support configurable quote character parameter in Select ( #8955 )
2020-03-13 22:09:34 -07:00
Klaus Post
e4020fb41f
SIMDJSON S3 select input ( #8401 )
2020-02-13 14:03:52 -08:00
Anis Elleuch
de924605a1
Import CSV parser library ( #8927 )
...
The CSV library code is imported from Go 1.13.6
2020-02-07 16:25:36 +05:30
Bruce Wang
c476b27a65
Comment typo "index max" to "index map" ( #8700 )
2019-12-24 21:57:43 -08:00
Harshavardhana
d48fd6fde9
Remove unusued params and functions ( #8399 )
2019-10-15 18:35:41 -07:00
Klaus Post
1c5b05c130
S3 select: Fix output conversion on select * ( #8303 )
...
Fixes #8268
2019-09-27 12:33:14 -07:00
Klaus Post
c9b8bd8de2
S3 Select: optimize output ( #8238 )
...
Queue output items and reuse them.
Remove the unneeded type system in sql and just use the Go type system.
In best case this is more than an order of magnitude speedup:
```
BenchmarkSelectAll_1M-12 1 1841049400 ns/op 274299728 B/op 4198522 allocs/op
BenchmarkSelectAll_1M-12 14 84833400 ns/op 169228346 B/op 3146541 allocs/op
```
2019-09-17 05:56:27 +05:30
Klaus Post
017456df63
Wait clearing the close channel ( #8250 )
...
Close channel should not be nilled before goroutines have exited.
Fixes potential hang on closing.
2019-09-16 16:18:01 -07:00
Klaus Post
ddea0bdf11
Concurrent CSV parsing and reduce S3 select allocations ( #8200 )
...
```
CSV parsing, BEFORE:
BenchmarkReaderBasic-12 2842 407533 ns/op 397860 B/op 957 allocs/op
BenchmarkReaderReplace-12 2718 429914 ns/op 397844 B/op 957 allocs/op
BenchmarkReaderReplaceTwo-12 2718 435556 ns/op 397855 B/op 957 allocs/op
BenchmarkAggregateCount_100K-12 171 6798974 ns/op 16667102 B/op 308077 allocs/op
BenchmarkAggregateCount_1M-12 19 65657411 ns/op 168057743 B/op 3146610 allocs/op
BenchmarkSelectAll_10M-12 1 20882119900 ns/op 2758799896 B/op 41978762 allocs/op
CSV parsing, AFTER:
BenchmarkReaderBasic-12 3721 312549 ns/op 101920 B/op 338 allocs/op
BenchmarkReaderReplace-12 3776 318810 ns/op 101993 B/op 340 allocs/op
BenchmarkReaderReplaceTwo-12 3610 330967 ns/op 102012 B/op 341 allocs/op
BenchmarkAggregateCount_100K-12 295 4149588 ns/op 3553623 B/op 103261 allocs/op
BenchmarkAggregateCount_1M-12 30 37746503 ns/op 33827931 B/op 1049435 allocs/op
BenchmarkSelectAll_10M-12 1 17608495800 ns/op 1416504040 B/op 21007082 allocs/op
~ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkReaderBasic-12 407533 312549 -23.31%
BenchmarkReaderReplace-12 429914 318810 -25.84%
BenchmarkReaderReplaceTwo-12 435556 330967 -24.01%
BenchmarkAggregateCount_100K-12 6798974 4149588 -38.97%
BenchmarkAggregateCount_1M-12 65657411 37746503 -42.51%
BenchmarkSelectAll_10M-12 20882119900 17608495800 -15.68%
benchmark old allocs new allocs delta
BenchmarkReaderBasic-12 957 338 -64.68%
BenchmarkReaderReplace-12 957 340 -64.47%
BenchmarkReaderReplaceTwo-12 957 341 -64.37%
BenchmarkAggregateCount_100K-12 308077 103261 -66.48%
BenchmarkAggregateCount_1M-12 3146610 1049435 -66.65%
BenchmarkSelectAll_10M-12 41978762 21007082 -49.96%
benchmark old bytes new bytes delta
BenchmarkReaderBasic-12 397860 101920 -74.38%
BenchmarkReaderReplace-12 397844 101993 -74.36%
BenchmarkReaderReplaceTwo-12 397855 102012 -74.36%
BenchmarkAggregateCount_100K-12 16667102 3553623 -78.68%
BenchmarkAggregateCount_1M-12 168057743 33827931 -79.87%
BenchmarkSelectAll_10M-12 2758799896 1416504040 -48.66%
```
```
BenchmarkReaderHuge/97K-12 2200 540840 ns/op 184.32 MB/s 1604450 B/op 687 allocs/op
BenchmarkReaderHuge/194K-12 1522 752257 ns/op 265.04 MB/s 2143135 B/op 1335 allocs/op
BenchmarkReaderHuge/389K-12 1190 947858 ns/op 420.69 MB/s 3221831 B/op 2630 allocs/op
BenchmarkReaderHuge/778K-12 806 1472486 ns/op 541.61 MB/s 5201856 B/op 5187 allocs/op
BenchmarkReaderHuge/1557K-12 426 2575269 ns/op 619.36 MB/s 9101330 B/op 10233 allocs/op
BenchmarkReaderHuge/3115K-12 286 4034656 ns/op 790.66 MB/s 12397968 B/op 16099 allocs/op
BenchmarkReaderHuge/6230K-12 172 6830563 ns/op 934.05 MB/s 16008416 B/op 26844 allocs/op
BenchmarkReaderHuge/12461K-12 100 11409467 ns/op 1118.39 MB/s 22655163 B/op 48107 allocs/op
BenchmarkReaderHuge/24922K-12 66 19780395 ns/op 1290.19 MB/s 35158559 B/op 90216 allocs/op
BenchmarkReaderHuge/49844K-12 34 37282559 ns/op 1369.03 MB/s 60528624 B/op 174497 allocs/op
```
2019-09-13 14:18:35 -07:00
Yao Zongyou
ec9bfd3aef
speed up the performance of s3select on csv ( #7945 )
2019-08-31 00:07:40 -07:00
Kanagaraj M
12353caf35
Fix: Support Unicode delimiters in s3 select ( #7931 )
2019-07-17 19:10:17 +01:00
Yao Zongyou
c4f480a839
fix csv read bug ( #7885 )
2019-07-05 12:08:56 -07:00
Yao Zongyou
037319066f
fix unicode support related bugs in s3select ( #7877 )
2019-07-05 09:43:10 -07:00
kannappanr
5ecac91a55
Replace Minio refs in docs with MinIO and links ( #7494 )
2019-04-09 11:39:42 -07:00
Aditya Manthramurthy
e463386921
Add JSON Path expression evaluation support ( #7315 )
...
- Includes support for FROM clause JSON path
2019-03-09 08:13:37 -08:00
Aditya Manthramurthy
f4879ed96d
Use jstream to serialize records to JSON format in S3Select ( #7318 )
...
- Also, switch to jstream to generate internal record representation
from CSV/JSON readers
- This fixes a bug in which JSON output objects have their keys
reversed from the order they are specified in the Select columns.
- Also includes a fix for tests.
2019-03-07 00:20:10 -08:00
Harshavardhana
2520e535a0
Allow lazyQuotes for certain types of CSV ( #7278 )
...
Set lazyQuotes to true, to allow a quote to appear
in an unquote field and a non-doubled quote may
appear in a quoted field.
2019-02-24 06:51:02 -08:00
Aditya Manthramurthy
4aa9ee153b
Fix S3 Select request XML parsing ( #7202 )
2019-02-06 13:25:52 -08:00
Aditya Manthramurthy
2786055df4
Add new SQL parser to support S3 Select syntax ( #7102 )
...
- New parser written from scratch, allows easier and complete parsing
of the full S3 Select SQL syntax. Parser definition is directly
provided by the AST defined for the SQL grammar.
- Bring support to parse and interpret SQL involving JSON path
expressions; evaluation of JSON path expressions will be
subsequently added.
- Bring automatic type inference and conversion for untyped
values (e.g. CSV data).
2019-01-28 17:59:48 -08:00
Bala FA
b0deea27df
Refactor s3select to support parquet. ( #7023 )
...
Also handle pretty formatted JSON documents.
2019-01-08 16:53:04 -08:00