mirror of
				https://github.com/minio/minio.git
				synced 2025-10-29 15:55:00 -04:00 
			
		
		
		
	vendor: Bring new updates from blake2b-simd repo. (#2094)
This vendorization is needed to bring in new improvements and support for AVX2 and SSE. Fixes #2081
This commit is contained in:
		
							parent
							
								
									8c767218a4
								
							
						
					
					
						commit
						169c72cdab
					
				
							
								
								
									
										136
									
								
								vendor/github.com/minio/blake2b-simd/README.md
									
									
									
										generated
									
									
										vendored
									
									
								
							
							
						
						
									
										136
									
								
								vendor/github.com/minio/blake2b-simd/README.md
									
									
									
										generated
									
									
										vendored
									
									
								
							| @ -6,15 +6,139 @@ Pure Go implementation of BLAKE2b using SIMD optimizations. | ||||
| Introduction | ||||
| ------------ | ||||
| 
 | ||||
| This package is based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merges it with the (`cgo` dependent) SSE optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a fallback for other architectures. | ||||
| This package was initially based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merged with the (`cgo` dependent) AVX optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on the [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a golang only fallback for other architectures. | ||||
| 
 | ||||
| It gives roughly a 3x performance improvement over the non-optimized go version. | ||||
| In addition to AVX there is also support for AVX2 as well as SSE. Best performance is obtained with AVX2 which gives roughly a **4X** performance increase approaching hashing speeds of **1GB/sec** on a single core. | ||||
| 
 | ||||
| Benchmarks | ||||
| ---------- | ||||
| 
 | ||||
| | Dura          |  1 GB | | ||||
| | ------------- |:-----:| | ||||
| | blake2b-SIMD  | 1.59s | | ||||
| | blake2b       | 4.66s | | ||||
| This is a summary of the performance improvements. Full details are shown below. | ||||
| 
 | ||||
| | Technology |  128K | | ||||
| | ---------- |:-----:| | ||||
| | AVX2       | 3.94x | | ||||
| | AVX        | 3.28x | | ||||
| | SSE        | 2.85x | | ||||
| 
 | ||||
| asm2plan9s | ||||
| ---------- | ||||
| 
 | ||||
| In order to be able to work more easily with AVX2/AVX instructions, a separate tool was developed to convert AVX2/AVX instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/fwessels/asm2plan9s) for more information. | ||||
| 
 | ||||
| bt2sum | ||||
| ------ | ||||
| 
 | ||||
| [bt2sum](https://github.com/s3git/bt2sum) is a utility that takes advantages of the BLAKE2b SIMD optimizations to compute check sums using the BLAKE2 Tree hashing mode in so called 'unlimited fanout' mode. | ||||
| 
 | ||||
| Technical details | ||||
| ----------------- | ||||
| 
 | ||||
| BLAKE2b is a hashing algorithm that operates on 64-bit integer values. The AVX2 version uses the 256-bit wide YMM registers in order to essentially process four operations in parallel. AVX and SSE operate on 128-bit values simultaneously (two operations in parallel). Below are excerpts from `compressAvx2_amd64.s`, `compressAvx_amd64.s`, and `compress_generic.go` respectively. | ||||
| 
 | ||||
| ``` | ||||
|     VPADDQ  YMM0,YMM0,YMM1   /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */ | ||||
| ``` | ||||
| 
 | ||||
| ``` | ||||
|     VPADDQ  XMM0,XMM0,XMM2   /* v0 += v4, v1 += v5 */ | ||||
|     VPADDQ  XMM1,XMM1,XMM3   /* v2 += v6, v3 += v7 */ | ||||
| ``` | ||||
| 
 | ||||
| ``` | ||||
|     v0 += v4 | ||||
|     v1 += v5 | ||||
|     v2 += v6 | ||||
|     v3 += v7 | ||||
| ``` | ||||
| 
 | ||||
| Detailed benchmarks | ||||
| ------------------- | ||||
| 
 | ||||
| Example performance metrics were generated on  Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 6 physical cores, 12 logical cores running Ubuntu GNU/Linux with kernel version 4.4.0-24-generic (vanilla with no optimizations). | ||||
| 
 | ||||
| ### AVX2 | ||||
| 
 | ||||
| ``` | ||||
| $ benchcmp go.txt avx2.txt | ||||
| benchmark                old ns/op     new ns/op     delta | ||||
| BenchmarkHash64-12       1481          849           -42.67% | ||||
| BenchmarkHash128-12      1428          746           -47.76% | ||||
| BenchmarkHash1K-12       6379          2227          -65.09% | ||||
| BenchmarkHash8K-12       37219         11714         -68.53% | ||||
| BenchmarkHash32K-12      140716        35935         -74.46% | ||||
| BenchmarkHash128K-12     561656        142634        -74.60% | ||||
| 
 | ||||
| benchmark                old MB/s     new MB/s     speedup | ||||
| BenchmarkHash64-12       43.20        75.37        1.74x | ||||
| BenchmarkHash128-12      89.64        171.35       1.91x | ||||
| BenchmarkHash1K-12       160.52       459.69       2.86x | ||||
| BenchmarkHash8K-12       220.10       699.32       3.18x | ||||
| BenchmarkHash32K-12      232.87       911.85       3.92x | ||||
| BenchmarkHash128K-12     233.37       918.93       3.94x | ||||
| ``` | ||||
| 
 | ||||
| ### AVX2: Comparison to other hashing techniques | ||||
| 
 | ||||
| ``` | ||||
| $ go test -bench=Comparison | ||||
| BenchmarkComparisonMD5-12    	    1000	   1726121 ns/op	 607.48 MB/s | ||||
| BenchmarkComparisonSHA1-12   	     500	   2005164 ns/op	 522.94 MB/s | ||||
| BenchmarkComparisonSHA256-12 	     300	   5531036 ns/op	 189.58 MB/s | ||||
| BenchmarkComparisonSHA512-12 	     500	   3423030 ns/op	 306.33 MB/s | ||||
| BenchmarkComparisonBlake2B-12	    1000	   1232690 ns/op	 850.64 MB/s | ||||
| ``` | ||||
| 
 | ||||
| Benchmarks below were generated on a MacBook Pro with a 2.7 GHz Intel Core i7. | ||||
| 
 | ||||
| ### AVX | ||||
| 
 | ||||
| ``` | ||||
| $ benchcmp go.txt  avx.txt  | ||||
| benchmark               old ns/op     new ns/op     delta | ||||
| BenchmarkHash64-8       813           458           -43.67% | ||||
| BenchmarkHash128-8      766           401           -47.65% | ||||
| BenchmarkHash1K-8       4881          1763          -63.88% | ||||
| BenchmarkHash8K-8       36127         12273         -66.03% | ||||
| BenchmarkHash32K-8      140582        43155         -69.30% | ||||
| BenchmarkHash128K-8     567850        173246        -69.49% | ||||
| 
 | ||||
| benchmark               old MB/s     new MB/s     speedup | ||||
| BenchmarkHash64-8       78.63        139.57       1.78x | ||||
| BenchmarkHash128-8      166.98       318.73       1.91x | ||||
| BenchmarkHash1K-8       209.76       580.68       2.77x | ||||
| BenchmarkHash8K-8       226.76       667.46       2.94x | ||||
| BenchmarkHash32K-8      233.09       759.29       3.26x | ||||
| BenchmarkHash128K-8     230.82       756.56       3.28x | ||||
| ``` | ||||
| 
 | ||||
| ### SSE | ||||
| 
 | ||||
| ``` | ||||
| $ benchcmp go.txt sse.txt  | ||||
| benchmark               old ns/op     new ns/op     delta | ||||
| BenchmarkHash64-8       813           478           -41.21% | ||||
| BenchmarkHash128-8      766           411           -46.34% | ||||
| BenchmarkHash1K-8       4881          1870          -61.69% | ||||
| BenchmarkHash8K-8       36127         12427         -65.60% | ||||
| BenchmarkHash32K-8      140582        49512         -64.78% | ||||
| BenchmarkHash128K-8     567850        199040        -64.95% | ||||
| 
 | ||||
| benchmark               old MB/s     new MB/s     speedup | ||||
| BenchmarkHash64-8       78.63        133.78       1.70x | ||||
| BenchmarkHash128-8      166.98       311.23       1.86x | ||||
| BenchmarkHash1K-8       209.76       547.37       2.61x | ||||
| BenchmarkHash8K-8       226.76       659.20       2.91x | ||||
| BenchmarkHash32K-8      233.09       661.81       2.84x | ||||
| BenchmarkHash128K-8     230.82       658.52       2.85x | ||||
| ``` | ||||
| 
 | ||||
| License | ||||
| ------- | ||||
| 
 | ||||
| Released under the Apache License v2.0. You can find the complete text in the file LICENSE. | ||||
| 
 | ||||
| Contributing | ||||
| ------------ | ||||
| 
 | ||||
| Contributions are welcome, please send PRs for any enhancements. | ||||
|  | ||||
							
								
								
									
										46
									
								
								vendor/github.com/minio/blake2b-simd/compressAvx2_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										46
									
								
								vendor/github.com/minio/blake2b-simd/compressAvx2_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,46 @@ | ||||
| //+build !noasm | ||||
| //+build !appengine | ||||
| 
 | ||||
| /* | ||||
|  * Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
|  * | ||||
|  * Licensed under the Apache License, Version 2.0 (the "License"); | ||||
|  * you may not use this file except in compliance with the License. | ||||
|  * You may obtain a copy of the License at | ||||
|  * | ||||
|  *     http://www.apache.org/licenses/LICENSE-2.0 | ||||
|  * | ||||
|  * Unless required by applicable law or agreed to in writing, software | ||||
|  * distributed under the License is distributed on an "AS IS" BASIS, | ||||
|  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
|  * See the License for the specific language governing permissions and | ||||
|  * limitations under the License. | ||||
|  */ | ||||
| 
 | ||||
| package blake2b | ||||
| 
 | ||||
| //go:noescape | ||||
| func compressAVX2Loop(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| 
 | ||||
| func compressAVX2(d *digest, p []uint8) { | ||||
| 
 | ||||
| 	in := make([]uint64, 8, 8) | ||||
| 	out := make([]uint64, 8, 8) | ||||
| 
 | ||||
| 	shffle := make([]uint64, 8, 8) | ||||
| 	// vector for PSHUFB instruction | ||||
| 	shffle[0] = 0x0201000706050403 | ||||
| 	shffle[1] = 0x0a09080f0e0d0c0b | ||||
| 	shffle[2] = 0x0201000706050403 | ||||
| 	shffle[3] = 0x0a09080f0e0d0c0b | ||||
| 	shffle[4] = 0x0100070605040302 | ||||
| 	shffle[5] = 0x09080f0e0d0c0b0a | ||||
| 	shffle[6] = 0x0100070605040302 | ||||
| 	shffle[7] = 0x09080f0e0d0c0b0a | ||||
| 
 | ||||
| 	in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7] = d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] | ||||
| 
 | ||||
| 	compressAVX2Loop(p, in, iv[:], d.t[:], d.f[:], shffle, out) | ||||
| 
 | ||||
| 	d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] = out[0], out[1], out[2], out[3], out[4], out[5], out[6], out[7] | ||||
| } | ||||
							
								
								
									
										671
									
								
								vendor/github.com/minio/blake2b-simd/compressAvx2_amd64.s
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										671
									
								
								vendor/github.com/minio/blake2b-simd/compressAvx2_amd64.s
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,671 @@ | ||||
| //+build !noasm !appengine | ||||
| 
 | ||||
| // | ||||
| // Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
| // | ||||
| // Licensed under the Apache License, Version 2.0 (the "License");
 | ||||
| // you may not use this file except in compliance with the License. | ||||
| // You may obtain a copy of the License at | ||||
| // | ||||
| //     http://www.apache.org/licenses/LICENSE-2.0 | ||||
| // | ||||
| // Unless required by applicable law or agreed to in writing, software | ||||
| // distributed under the License is distributed on an "AS IS" BASIS, | ||||
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| // See the License for the specific language governing permissions and | ||||
| // limitations under the License. | ||||
| // | ||||
| 
 | ||||
| // | ||||
| // Based on AVX2 implementation from https://github.com/sneves/blake2-avx2/blob/master/blake2b-common.h | ||||
| // | ||||
| // Use github.com/fwessels/asm2plan9s on this file to assemble instructions to their Plan9 equivalent | ||||
| // | ||||
| // Assembly code below essentially follows the ROUND macro (see blake2b-round.h) which is defined as: | ||||
| //   #define ROUND(r) \ | ||||
| //     LOAD_MSG_ ##r ##_1(b0, b1); \
 | ||||
| //     G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     LOAD_MSG_ ##r ##_2(b0, b1); \
 | ||||
| //     G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h); \
 | ||||
| //     LOAD_MSG_ ##r ##_3(b0, b1); \
 | ||||
| //     G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     LOAD_MSG_ ##r ##_4(b0, b1); \
 | ||||
| //     G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
 | ||||
| // | ||||
| // as well as the go equivalent in https://github.com/dchest/blake2b/blob/master/block.go | ||||
| // | ||||
| // As in the macro, G1/G2 in the 1st and 2nd half are identical (so literal copy of assembly) | ||||
| // | ||||
| // Rounds are also the same, except for the loading of the message (and rounds 1 & 11 and | ||||
| // rounds 2 & 12 are identical) | ||||
| // | ||||
| 
 | ||||
| #define G1 \ | ||||
|     \ // G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1);
 | ||||
|     BYTE $0xc5; BYTE $0xfd; BYTE $0xd4; BYTE $0xc4             \ // VPADDQ  YMM0,YMM0,YMM4   /* v0 += m[0], v1 += m[2], v2 += m[4], v3 += m[6] */
 | ||||
|     BYTE $0xc5; BYTE $0xfd; BYTE $0xd4; BYTE $0xc1             \ // VPADDQ  YMM0,YMM0,YMM1   /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */
 | ||||
|     BYTE $0xc5; BYTE $0xe5; BYTE $0xef; BYTE $0xd8             \ // VPXOR   YMM3,YMM3,YMM0   /* v12 ^= v0, v13 ^= v1, v14 ^= v2, v15 ^= v3 */
 | ||||
|     BYTE $0xc5; BYTE $0xfd; BYTE $0x70; BYTE $0xdb; BYTE $0xb1 \ // VPSHUFD YMM3,YMM3,0xb1   /* v12 = v12<<(64-32) | v12>>32, v13 = */
 | ||||
|     BYTE $0xc5; BYTE $0xed; BYTE $0xd4; BYTE $0xd3             \ // VPADDQ  YMM2,YMM2,YMM3   /* v8 += v12, v9 += v13, v10 += v14, v11 += v15 */
 | ||||
|     BYTE $0xc5; BYTE $0xf5; BYTE $0xef; BYTE $0xca             \ // VPXOR   YMM1,YMM1,YMM2   /* v4 ^= v8, v5 ^= v9, v6 ^= v10, v7 ^= v11 */
 | ||||
|     BYTE $0xc4; BYTE $0xe2; BYTE $0x75; BYTE $0x00; BYTE $0xce   // VPSHUFB YMM1,YMM1,YMM6   /* v4 = v4<<(64-24) | v4>>24, ..., ..., v7 = v7<<(64-24) | v7>>24 */
 | ||||
| 
 | ||||
| #define G2 \ | ||||
|     BYTE $0xc5; BYTE $0xfd; BYTE $0xd4; BYTE $0xc5             \ // VPADDQ  YMM0,YMM0,YMM5   /* v0 += m[1], v1 += m[3], v2 += m[5], v3 += m[7] */
 | ||||
|     BYTE $0xc5; BYTE $0xfd; BYTE $0xd4; BYTE $0xc1             \ // VPADDQ  YMM0,YMM0,YMM1   /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */
 | ||||
|     BYTE $0xc5; BYTE $0xe5; BYTE $0xef; BYTE $0xd8             \ // VPXOR   YMM3,YMM3,YMM0   /* v12 ^= v0, v13 ^= v1, v14 ^= v2, v15 ^= v3 */
 | ||||
|     BYTE $0xc4; BYTE $0xe2; BYTE $0x65; BYTE $0x00; BYTE $0xdf \ // VPSHUFB YMM3,YMM3,YMM7   /* v12 = v12<<(64-16) | v12>>16, ..., ..., v15 = v15<<(64-16) | v15>>16 */
 | ||||
|     BYTE $0xc5; BYTE $0xed; BYTE $0xd4; BYTE $0xd3             \ // VPADDQ  YMM2,YMM2,YMM3   /* v8 += v12, v9 += v13, v10 += v14, v11 += v15 */
 | ||||
|     BYTE $0xc5; BYTE $0xf5; BYTE $0xef; BYTE $0xca             \ // VPXOR   YMM1,YMM1,YMM2   /* v4 ^= v8, v5 ^= v9, v6 ^= v10, v7 ^= v11 */
 | ||||
|     BYTE $0xc5; BYTE $0x75; BYTE $0xd4; BYTE $0xf9             \ // VPADDQ  YMM15,YMM1,YMM1  /* temp reg = reg*2   */
 | ||||
|     BYTE $0xc5; BYTE $0xf5; BYTE $0x73; BYTE $0xd1; BYTE $0x3f \ // VPSRLQ  YMM1,YMM1,0x3f   /*      reg = reg>>63 */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x75; BYTE $0xef; BYTE $0xcf   // VPXOR   YMM1,YMM1,YMM15  /* ORed together: v4 = v4<<(64-63) | v4>>63, v5 = v5<<(64-63) | v5>>63 */
 | ||||
| 
 | ||||
| #define DIAGONALIZE \ | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xdb \ // VPERMQ YMM3, YMM3, 0x93
 | ||||
|                 BYTE $0x93                                     \ | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xd2 \ // VPERMQ YMM2, YMM2, 0x4e
 | ||||
|                 BYTE $0x4e                                     \ | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xc9 \ // VPERMQ YMM1, YMM1, 0x39
 | ||||
|                 BYTE $0x39                                     \ | ||||
|     // DO NOT DELETE -- macro delimiter (previous line extended) | ||||
| 
 | ||||
| #define UNDIAGONALIZE \ | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xdb \ // VPERMQ YMM3, YMM3, 0x39
 | ||||
|                 BYTE $0x39                                     \ | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xd2 \ // VPERMQ YMM2, YMM2, 0x4e
 | ||||
|                 BYTE $0x4e                                     \ | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xc9 \ // VPERMQ YMM1, YMM1, 0x93
 | ||||
|                 BYTE $0x93                                     \ | ||||
|     // DO NOT DELETE -- macro delimiter (previous line extended) | ||||
| 
 | ||||
| #define LOAD_SHUFFLE \ | ||||
|     MOVQ   shffle+120(FP), SI \ // SI: &shuffle | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x6f; BYTE $0x36             \ // VMOVDQU YMM6, [rsi]
 | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x6f; BYTE $0x7e; BYTE $0x20   // VMOVDQU YMM7, 32[rsi]
 | ||||
| 
 | ||||
| // func compressAVX2Loop(compressSSE(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| TEXT ·compressAVX2Loop(SB), 7, $0 | ||||
| 
 | ||||
|     // REGISTER USE | ||||
|     //  Y0 - Y3: v0 - v15 | ||||
|     //  Y4 - Y5: m[0] - m[7] | ||||
|     //  Y6 - Y7: shuffle value | ||||
|     //  Y8 - Y9: temp registers | ||||
|     // Y10 -Y13: copy of full message | ||||
|     //      Y15: temp register | ||||
| 
 | ||||
|     // Load digest | ||||
|     MOVQ   in+24(FP),  SI     // SI: &in | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x6f; BYTE $0x06               // VMOVDQU YMM0, [rsi]
 | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x6f; BYTE $0x4e; BYTE $0x20   // VMOVDQU YMM1, 32[rsi]
 | ||||
| 
 | ||||
|     // Already store digest into &out (so we can reload it later generically) | ||||
|     MOVQ  out+144(FP), SI     // SI: &out | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x7f; BYTE $0x06               // VMOVDQU [rsi], YMM0
 | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x7f; BYTE $0x4e; BYTE $0x20   // VMOVDQU 32[rsi], YMM1
 | ||||
| 
 | ||||
|     // Initialize message pointer and loop counter | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVQ   message_len+8(FP), R8 // R8: len(message) | ||||
|     SHRQ   $7, R8             // len(message) / 128 | ||||
|     CMPQ   R8, $0 | ||||
|     JEQ    complete | ||||
| 
 | ||||
| loop: | ||||
|     // Increment counter | ||||
|     MOVQ t+72(FP), SI         // SI: &t | ||||
|     MOVQ   0(SI), R9          // | ||||
|     ADDQ   $128, R9           //                       /* d.t[0] += BlockSize */ | ||||
|     MOVQ   R9, 0(SI)          // | ||||
|     CMPQ   R9, $128           //                       /* if d.t[0] < BlockSize { */ | ||||
|     JGE    noincr             // | ||||
|     MOVQ   8(SI), R9          // | ||||
|     ADDQ   $1, R9             //                       /*     d.t[1]++ */ | ||||
|     MOVQ   R9, 8(SI)          // | ||||
| noincr:                       //                       /* } */ | ||||
| 
 | ||||
|     // Load initialization vector | ||||
|     MOVQ iv+48(FP), SI        // SI: &iv | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x6f; BYTE $0x16               // VMOVDQU YMM2, [rsi]
 | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x6f; BYTE $0x5e; BYTE $0x20   // VMOVDQU YMM3, 32[rsi]
 | ||||
|     MOVQ t+72(FP), SI         // SI: &t | ||||
|     BYTE $0xc4; BYTE $0x63; BYTE $0x3d; BYTE $0x38; BYTE $0x06   // VINSERTI128 YMM8, YMM8, [rsi], 0  /* Y8 = t[0]+t[1] */
 | ||||
|                 BYTE $0x00 | ||||
|     MOVQ t+96(FP), SI         // SI: &f | ||||
|     BYTE $0xc4; BYTE $0x63; BYTE $0x3d; BYTE $0x38; BYTE $0x06   // VINSERTI128 YMM8, YMM8, [rsi], 1  /* Y8 = t[0]+t[1]+f[0]+f[1] */
 | ||||
|                 BYTE $0x01 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x65; BYTE $0xef; BYTE $0xd8   // VPXOR   YMM3,YMM3,YMM8            /* Y3 = Y3 ^ Y8 */
 | ||||
| 
 | ||||
|     BYTE $0xc5; BYTE $0x7e; BYTE $0x6f; BYTE $0x12               // VMOVDQU YMM10, [rdx]              /* Y10 =  m[0]+ m[1]+ m[2]+ m[3] */
 | ||||
|     BYTE $0xc5; BYTE $0x7e; BYTE $0x6f; BYTE $0x5a; BYTE $0x20   // VMOVDQU YMM11, 32[rdx]            /* Y11 =  m[4]+ m[5]+ m[6]+ m[7] */
 | ||||
|     BYTE $0xc5; BYTE $0x7e; BYTE $0x6f; BYTE $0x62; BYTE $0x40   // VMOVDQU YMM12, 64[rdx]            /* Y12 =  m[8]+ m[9]+m[10]+m[11] */
 | ||||
|     BYTE $0xc5; BYTE $0x7e; BYTE $0x6f; BYTE $0x6a; BYTE $0x60   // VMOVDQU YMM13, 96[rdx]            /* Y13 = m[12]+m[13]+m[14]+m[15] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x2d; BYTE $0x6c; BYTE $0xe3   // VPUNPCKLQDQ  YMM4, YMM10, YMM11 /* m[0], m[4], m[2], m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x2d; BYTE $0x6d; BYTE $0xeb   // VPUNPCKHQDQ  YMM5, YMM10, YMM11 /* m[1], m[5], m[3], m[7] */
 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xe4   // VPERMQ       YMM4, YMM4, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xed   // VPERMQ       YMM5, YMM5, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x1d; BYTE $0x6c; BYTE $0xe5   // VPUNPCKLQDQ  YMM4, YMM12, YMM13 /* m[8], m[12], m[10], m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x1d; BYTE $0x6d; BYTE $0xed   // VPUNPCKHQDQ  YMM5, YMM12, YMM13 /* m[9], m[13], m[11], m[15] */
 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xe4   // VPERMQ       YMM4, YMM4, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xed   // VPERMQ       YMM5, YMM5, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   2 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc5   // VPUNPCKLQDQ  YMM8, YMM11, YMM13        /*  m[4],  ____,  ____, m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8,  YMM8, 0x03         /* m[14],  m[4],  ____,  ____ */ /* xxxx 0011 = 0x03 */
 | ||||
|                 BYTE $0x03 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xcd   // VPUNPCKHQDQ  YMM9, YMM12, YMM13        /*  m[9], m[13],  ____,  ____ */
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4,  YMM8,  YMM9, 0x20  /*  m[9], m[13],  ____,  ____ */ /* 0010 0000 = 0x20 */
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc4   // VPERMQ       YMM8,  YMM12, 0x02        /* m[10],  m[8],  ____,  ____ */ /* xxxx 0010 = 0x02 */
 | ||||
|                 BYTE $0x02 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9,  YMM13, 0x30        /*  ____,  ____, m[15],  ____ */ /* xx11 xxxx = 0x30 */
 | ||||
|                 BYTE $0x30 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x35; BYTE $0x6c; BYTE $0xcb   // VPUNPCKLQDQ  YMM9,   YMM9, YMM11       /*  ____,  ____, m[15],  m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5,   YMM8, YMM9, 0x30  /*  m[9], m[13], m[15],  m[6] */ /* 0011 0000 = 0x30 */
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc2   // VPERMQ       YMM8, YMM10, 0x01         /*  m[1],  m[0],  ____,  ____ */ /* xxxx 0001 = 0x01 */
 | ||||
|                 BYTE $0x01 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xcc   // VPUNPCKHQDQ  YMM9, YMM11, YMM12        /*  m[5],  ____,  ____, m[11] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9,  YMM9, 0x03         /* m[11],  m[5],  ____,  ____ */ /* xxxx 0011 = 0x03 */
 | ||||
|                 BYTE $0x03 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4,  YMM8, YMM9, 0x20   /*  m[1],  m[0], m[11],  m[5] */ /* 0010 0000 = 0x20 */
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc5   // VPUNPCKLQDQ  YMM8, YMM10, YMM13        /*  ___,  m[12],  m[2],  ____ */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8,  YMM8, 0x09         /* m[12],  m[2],  ____,  ____ */ /* xxxx 1001 = 0x09 */
 | ||||
|                 BYTE $0x09 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xca   // VPUNPCKHQDQ  YMM9, YMM11, YMM10        /*  ____,  ____,  m[7],  m[3] */
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5,  YMM8, YMM9, 0x30   /*  m[9], m[13], m[15],  m[6] */ /* 0011 0000 = 0x30 */
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   3 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc5   // VPERMQ       YMM8, YMM13, 0x00
 | ||||
|                 BYTE $0x00 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xc0   // VPUNPCKHQDQ  YMM8, YMM12, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xcd   // VPUNPCKHQDQ  YMM9, YMM11, YMM13
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6c; BYTE $0xc2   // VPUNPCKLQDQ  YMM8, YMM12, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9, YMM13, 0x55
 | ||||
|                 BYTE $0x55 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM10, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc2   // VPERMQ       YMM8, YMM10, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  YMM8, YMM12, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xcc   // VPUNPCKHQDQ  YMM9, YMM11, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xc3   // VPUNPCKLQDQ  YMM8, YMM13, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcb   // VPERMQ       YMM9, YMM11, 0x00
 | ||||
|                 BYTE $0x00 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6d; BYTE $0xc9   // VPUNPCKHQDQ  YMM9, YMM10, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   4 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xc2   // VPUNPCKHQDQ  YMM8, YMM11, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xcc   // VPUNPCKHQDQ  YMM9, YMM13, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xc2   // VPUNPCKHQDQ  YMM8, YMM12, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9, YMM13, 0x08
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc3   // VPERMQ       YMM8, YMM11, 0x55
 | ||||
|                 BYTE $0x55 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  YMM8, YMM10, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9, YMM13, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM11, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc4   // VPUNPCKLQDQ  YMM8, YMM11, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xcc   // VPUNPCKLQDQ  YMM9, YMM10, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   5 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xc3   // VPUNPCKHQDQ  YMM8, YMM12, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xcc   // VPUNPCKLQDQ  YMM9, YMM10, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc3   // VPERMQ       YMM8, YMM11, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  YMM8, YMM10, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9, YMM13, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM11, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc4   // VPERMQ       YMM8, YMM12, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  YMM8, YMM13, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xca   // VPERMQ       YMM9, YMM10, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM11, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc5   // VPERMQ       YMM8, YMM13, 0x00
 | ||||
|                 BYTE $0x00 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6d; BYTE $0xc0   // VPUNPCKHQDQ  YMM8, YMM10, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9, YMM13, 0x55
 | ||||
|                 BYTE $0x55 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM12, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   6 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc3   // VPUNPCKLQDQ  YMM8, YMM10, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xcc   // VPUNPCKLQDQ  YMM9, YMM10, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xc4   // VPUNPCKLQDQ  YMM8, YMM13, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xca   // VPUNPCKHQDQ  YMM9, YMM12, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc3   // VPERMQ       YMM8, YMM11, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xca   // VPUNPCKHQDQ  YMM9, YMM13, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xc3   // VPUNPCKHQDQ  YMM8, YMM13, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcc   // VPERMQ       YMM9, YMM12, 0x55
 | ||||
|                 BYTE $0x55 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM13, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   7 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc2   // VPERMQ       YMM8, YMM10, 0x55
 | ||||
|                 BYTE $0x55 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  YMM8, YMM13, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xcb   // VPUNPCKLQDQ  YMM9, YMM13, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xc5   // VPUNPCKHQDQ  YMM8, YMM11, YMM13
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcc   // VPERMQ       YMM9, YMM12, 0xaa
 | ||||
|                 BYTE $0xaa | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xc9   // VPUNPCKHQDQ  YMM9, YMM13, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc3   // VPUNPCKLQDQ  YMM8, YMM10, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcc   // VPERMQ       YMM9, YMM12, 0x01
 | ||||
|                 BYTE $0x01 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xc2   // VPUNPCKHQDQ  YMM8, YMM11, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcc   // VPERMQ       YMM9, YMM12, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM10, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   8 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xc3   // VPUNPCKHQDQ  YMM8, YMM13, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xca   // VPERMQ       YMM9, YMM10, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xc9   // VPUNPCKLQDQ  YMM9, YMM13, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc5   // VPERMQ       YMM8, YMM13, 0xaa
 | ||||
|                 BYTE $0xaa | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xc0   // VPUNPCKHQDQ  YMM8, YMM12, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6d; BYTE $0xcc   // VPUNPCKHQDQ  YMM9, YMM10, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xc5   // VPUNPCKHQDQ  YMM8, YMM11, YMM13
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6c; BYTE $0xca   // VPUNPCKLQDQ  YMM9, YMM12, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x0c
 | ||||
|                 BYTE $0x0c | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc3   // VPUNPCKLQDQ  YMM8, YMM10, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xcc   // VPUNPCKLQDQ  YMM9, YMM11, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x30
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   9 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc5   // VPUNPCKLQDQ  YMM8, YMM11, YMM13
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xca   // VPERMQ       YMM9, YMM10, 0x00
 | ||||
|                 BYTE $0x00 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xc9   // VPUNPCKHQDQ  YMM9, YMM12, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xc4   // VPUNPCKHQDQ  YMM8, YMM13, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcc   // VPERMQ       YMM9, YMM12, 0x00
 | ||||
|                 BYTE $0x00 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6d; BYTE $0xc9   // VPUNPCKHQDQ  YMM9, YMM10, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcc   // VPERMQ       YMM9, YMM12, 0xaa
 | ||||
|                 BYTE $0xaa | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6d; BYTE $0xc9   // VPUNPCKHQDQ  YMM9, YMM10, YMM9
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x15; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM13, YMM9, 0x20
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc3   // VPERMQ       YMM8, YMM11, 0xff
 | ||||
|                 BYTE $0xff | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  YMM8, YMM10, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcb   // VPERMQ       YMM9, YMM11, 0x04
 | ||||
|                 BYTE $0x04 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   10 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc4   // VPERMQ       YMM8, YMM12, 0x20
 | ||||
|                 BYTE $0x20 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xca   // VPUNPCKHQDQ  YMM9, YMM11, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc3   // VPUNPCKLQDQ  YMM8, YMM10, YMM11
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcb   // VPERMQ       YMM9, YMM11, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6d; BYTE $0xc4   // VPUNPCKHQDQ  YMM8, YMM13, YMM12
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8, YMM8, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6d; BYTE $0xcd   // VPUNPCKHQDQ  YMM9, YMM10, YMM13
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9, YMM9, 0x60
 | ||||
|                 BYTE $0x60 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4, YMM8, YMM9, 0x31
 | ||||
|                 BYTE $0x31 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc5   // VPERMQ       YMM8, YMM13, 0xaa
 | ||||
|                 BYTE $0xaa | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xc0   // VPUNPCKHQDQ  YMM8, YMM12, YMM8
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x15; BYTE $0x6c; BYTE $0xca   // VPUNPCKLQDQ  YMM9, YMM13, YMM10
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5, YMM8, YMM9, 0x21
 | ||||
|                 BYTE $0x21 | ||||
|                                                                   | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 1 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x2d; BYTE $0x6c; BYTE $0xe3   // VPUNPCKLQDQ  YMM4, YMM10, YMM11 /* m[0], m[4], m[2], m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x2d; BYTE $0x6d; BYTE $0xeb   // VPUNPCKHQDQ  YMM5, YMM10, YMM11 /* m[1], m[5], m[3], m[7] */
 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xe4   // VPERMQ       YMM4, YMM4, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xed   // VPERMQ       YMM5, YMM5, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x1d; BYTE $0x6c; BYTE $0xe5   // VPUNPCKLQDQ  YMM4, YMM12, YMM13 /* m[8], m[12], m[10], m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x1d; BYTE $0x6d; BYTE $0xed   // VPUNPCKHQDQ  YMM5, YMM12, YMM13 /* m[9], m[13], m[11], m[15] */
 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xe4   // VPERMQ       YMM4, YMM4, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
|     BYTE $0xc4; BYTE $0xe3; BYTE $0xfd; BYTE $0x00; BYTE $0xed   // VPERMQ       YMM5, YMM5, 0xd8   /* 0x1101 1000 = 0xd8 */
 | ||||
|                 BYTE $0xd8 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 2 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6c; BYTE $0xc5   // VPUNPCKLQDQ  YMM8, YMM11, YMM13        /*  m[4],  ____,  ____, m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8,  YMM8, 0x03         /* m[14],  m[4],  ____,  ____ */ /* xxxx 0011 = 0x03 */
 | ||||
|                 BYTE $0x03 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x1d; BYTE $0x6d; BYTE $0xcd   // VPUNPCKHQDQ  YMM9, YMM12, YMM13        /*  m[9], m[13],  ____,  ____ */
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4,  YMM8,  YMM9, 0x20  /*  m[9], m[13],  ____,  ____ */ /* 0010 0000 = 0x20 */
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc4   // VPERMQ       YMM8,  YMM12, 0x02        /* m[10],  m[8],  ____,  ____ */ /* xxxx 0010 = 0x02 */
 | ||||
|                 BYTE $0x02 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xcd   // VPERMQ       YMM9,  YMM13, 0x30        /*  ____,  ____, m[15],  ____ */ /* xx11 xxxx = 0x30 */
 | ||||
|                 BYTE $0x30 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x35; BYTE $0x6c; BYTE $0xcb   // VPUNPCKLQDQ  YMM9,   YMM9, YMM11       /*  ____,  ____, m[15],  m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5,   YMM8, YMM9, 0x30  /*  m[9], m[13], m[15],  m[6] */ /* 0011 0000 = 0x30 */
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc2   // VPERMQ       YMM8, YMM10, 0x01         /*  m[1],  m[0],  ____,  ____ */ /* xxxx 0001 = 0x01 */
 | ||||
|                 BYTE $0x01 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xcc   // VPUNPCKHQDQ  YMM9, YMM11, YMM12        /*  m[5],  ____,  ____, m[11] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc9   // VPERMQ       YMM9,  YMM9, 0x03         /* m[11],  m[5],  ____,  ____ */ /* xxxx 0011 = 0x03 */
 | ||||
|                 BYTE $0x03 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe1   // VPERM2I128   YMM4,  YMM8, YMM9, 0x20   /*  m[1],  m[0], m[11],  m[5] */ /* 0010 0000 = 0x20 */
 | ||||
|                 BYTE $0x20 | ||||
| 
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x2d; BYTE $0x6c; BYTE $0xc5   // VPUNPCKLQDQ  YMM8, YMM10, YMM13        /*  ___,  m[12],  m[2],  ____ */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0xfd; BYTE $0x00; BYTE $0xc0   // VPERMQ       YMM8,  YMM8, 0x09         /* m[12],  m[2],  ____,  ____ */ /* xxxx 1001 = 0x09 */
 | ||||
|                 BYTE $0x09 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x25; BYTE $0x6d; BYTE $0xca   // VPUNPCKHQDQ  YMM9, YMM11, YMM10        /*  ____,  ____,  m[7],  m[3] */
 | ||||
|     BYTE $0xc4; BYTE $0xc3; BYTE $0x3d; BYTE $0x46; BYTE $0xe9   // VPERM2I128   YMM5,  YMM8, YMM9, 0x30   /*  m[9], m[13], m[15],  m[6] */ /* 0011 0000 = 0x30 */
 | ||||
|                 BYTE $0x30 | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     // Reload digest (most current value store in &out) | ||||
|     MOVQ  out+144(FP),  SI    // SI: &in | ||||
|     BYTE $0xc5; BYTE $0x7e; BYTE $0x6f; BYTE $0x26               // VMOVDQU YMM12, [rsi]
 | ||||
|     BYTE $0xc5; BYTE $0x7e; BYTE $0x6f; BYTE $0x6e; BYTE $0x20   // VMOVDQU YMM13, 32[rsi]
 | ||||
| 
 | ||||
|     BYTE $0xc5; BYTE $0xfd; BYTE $0xef; BYTE $0xc2               // VPXOR   YMM0,YMM0,YMM2   /* X0 = X0 ^ X4,  X1 = X1 ^ X5 */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x7d; BYTE $0xef; BYTE $0xc4   // VPXOR   YMM0,YMM0,YMM12  /* X0 = X0 ^ X12, X1 = X1 ^ X13 */
 | ||||
|     BYTE $0xc5; BYTE $0xf5; BYTE $0xef; BYTE $0xcb               // VPXOR   YMM1,YMM1,YMM3   /* X2 = X2 ^ X6,  X3 = X3 ^ X7 */
 | ||||
|     BYTE $0xc4; BYTE $0xc1; BYTE $0x75; BYTE $0xef; BYTE $0xcd   // VPXOR   YMM1,YMM1,YMM13  /* X2 = X2 ^ X14, X3 = X3 ^ X15 */
 | ||||
| 
 | ||||
|     // Store digest into &out | ||||
|     MOVQ  out+144(FP), SI     // SI: &out | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x7f; BYTE $0x06               // VMOVDQU [rsi], YMM0
 | ||||
|     BYTE $0xc5; BYTE $0xfe; BYTE $0x7f; BYTE $0x4e; BYTE $0x20   // VMOVDQU 32[rsi], YMM1
 | ||||
| 
 | ||||
|     // Increment message pointer and check if there's more to do | ||||
|     ADDQ   $128, DX           // message += 128 | ||||
|     SUBQ   $1, R8 | ||||
|     JNZ    loop | ||||
| 
 | ||||
| complete: | ||||
|     BYTE $0xc5; BYTE $0xf8; BYTE $0x77                           // VZEROUPPER  /* Prevent further context switches */
 | ||||
|     RET | ||||
| 
 | ||||
							
								
								
									
										40
									
								
								vendor/github.com/minio/blake2b-simd/compressAvx_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										40
									
								
								vendor/github.com/minio/blake2b-simd/compressAvx_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,40 @@ | ||||
| //+build !noasm | ||||
| //+build !appengine | ||||
| 
 | ||||
| /* | ||||
|  * Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
|  * | ||||
|  * Licensed under the Apache License, Version 2.0 (the "License"); | ||||
|  * you may not use this file except in compliance with the License. | ||||
|  * You may obtain a copy of the License at | ||||
|  * | ||||
|  *     http://www.apache.org/licenses/LICENSE-2.0 | ||||
|  * | ||||
|  * Unless required by applicable law or agreed to in writing, software | ||||
|  * distributed under the License is distributed on an "AS IS" BASIS, | ||||
|  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
|  * See the License for the specific language governing permissions and | ||||
|  * limitations under the License. | ||||
|  */ | ||||
| 
 | ||||
| package blake2b | ||||
| 
 | ||||
| //go:noescape | ||||
| func blockAVXLoop(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| 
 | ||||
| func compressAVX(d *digest, p []uint8) { | ||||
| 
 | ||||
| 	in := make([]uint64, 8, 8) | ||||
| 	out := make([]uint64, 8, 8) | ||||
| 
 | ||||
| 	shffle := make([]uint64, 2, 2) | ||||
| 	// vector for PSHUFB instruction | ||||
| 	shffle[0] = 0x0201000706050403 | ||||
| 	shffle[1] = 0x0a09080f0e0d0c0b | ||||
| 
 | ||||
| 	in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7] = d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] | ||||
| 
 | ||||
| 	blockAVXLoop(p, in, iv[:], d.t[:], d.f[:], shffle, out) | ||||
| 
 | ||||
| 	d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] = out[0], out[1], out[2], out[3], out[4], out[5], out[6], out[7] | ||||
| } | ||||
| @ -119,10 +119,12 @@ | ||||
|     MOVQ   shffle+120(FP), SI \ // SI: &shuffle | ||||
|     MOVOU  0(SI), X12           // X12 = 03040506 07000102 0b0c0d0e 0f08090a | ||||
| 
 | ||||
| 
 | ||||
| // func blockAVX(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| TEXT ·blockAVX(SB), 7, $0 | ||||
| // func blockAVXLoop(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| TEXT ·blockAVXLoop(SB), 7, $0 | ||||
|     // REGISTER USE | ||||
|     //        R8: loop counter | ||||
|     //        DX: message pointer | ||||
|     //        SI: temp pointer for loading | ||||
|     //  X0 -  X7: v0 - v15 | ||||
|     //  X8 - X11: m[0] - m[7] | ||||
|     //       X12: shuffle value | ||||
| @ -135,16 +137,43 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     MOVOU  32(SI), X2         // X2 = in[4]+in[5]      /* row2l = LOAD( &S->h[4] ); */
 | ||||
|     MOVOU  48(SI), X3         // X3 = in[6]+in[7]      /* row2h = LOAD( &S->h[6] ); */
 | ||||
| 
 | ||||
|     // Load initialization vector | ||||
|     MOVQ iv+48(FP), DX        // DX: &iv | ||||
|     MOVOU   0(DX), X4         // X4 = iv[0]+iv[1]      /* row3l = LOAD( &blake2b_IV[0] ); */
 | ||||
|     MOVOU  16(DX), X5         // X5 = iv[2]+iv[3]      /* row3h = LOAD( &blake2b_IV[2] ); */
 | ||||
|     // Already store digest into &out (so we can reload it later generically) | ||||
|     MOVQ  out+144(FP), SI     // SI: &out | ||||
|     MOVOU  X0,  0(SI)         // out[0]+out[1] = X0 | ||||
|     MOVOU  X1, 16(SI)         // out[2]+out[3] = X1 | ||||
|     MOVOU  X2, 32(SI)         // out[4]+out[5] = X2 | ||||
|     MOVOU  X3, 48(SI)         // out[6]+out[7] = X3 | ||||
| 
 | ||||
|     // Initialize message pointer and loop counter | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVQ   message_len+8(FP), R8 // R8: len(message) | ||||
|     SHRQ   $7, R8             // len(message) / 128 | ||||
|     CMPQ   R8, $0 | ||||
|     JEQ    complete | ||||
| 
 | ||||
| loop: | ||||
|     // Increment counter | ||||
|     MOVQ t+72(FP), SI         // SI: &t | ||||
|     MOVOU  32(DX), X6         // X6 = iv[4]+iv[5]      /*                        LOAD( &blake2b_IV[4] )                      */ | ||||
|     MOVOU   0(SI), X7         // X7 = t[0]+t[1]        /*                                                LOAD( &S->t[0] )    */ | ||||
|     PXOR       X7, X6         // X6 = X6 ^ X7          /* row4l = _mm_xor_si128(                       ,                  ); */
 | ||||
|     MOVQ   0(SI), R9          // | ||||
|     ADDQ   $128, R9           //                       /* d.t[0] += BlockSize */ | ||||
|     MOVQ   R9, 0(SI)          // | ||||
|     CMPQ   R9, $128           //                       /* if d.t[0] < BlockSize { */ | ||||
|     JGE    noincr             // | ||||
|     MOVQ   8(SI), R9          // | ||||
|     ADDQ   $1, R9             //                       /*     d.t[1]++ */ | ||||
|     MOVQ   R9, 8(SI)          // | ||||
| noincr:                       //                       /* } */ | ||||
| 
 | ||||
|     // Load initialization vector | ||||
|     MOVQ iv+48(FP), SI        // SI: &iv | ||||
|     MOVOU   0(SI), X4         // X4 = iv[0]+iv[1]      /* row3l = LOAD( &blake2b_IV[0] ); */
 | ||||
|     MOVOU  16(SI), X5         // X5 = iv[2]+iv[3]      /* row3h = LOAD( &blake2b_IV[2] ); */
 | ||||
|     MOVOU  32(SI), X6         // X6 = iv[4]+iv[5]      /*                        LOAD( &blake2b_IV[4] )                      */ | ||||
|     MOVOU  48(SI), X7         // X7 = iv[6]+iv[7]      /*                        LOAD( &blake2b_IV[6] )                      */ | ||||
|     MOVQ t+72(FP), SI         // SI: &t | ||||
|     MOVOU   0(SI), X8         // X8 = t[0]+t[1]        /*                                                LOAD( &S->t[0] )    */ | ||||
|     PXOR       X8, X6         // X6 = X6 ^ X8          /* row4l = _mm_xor_si128(                       ,                  ); */
 | ||||
|     MOVQ t+96(FP), SI         // SI: &f | ||||
|     MOVOU  48(DX), X7         // X7 = iv[6]+iv[7]      /*                        LOAD( &blake2b_IV[6] )                      */ | ||||
|     MOVOU   0(SI), X8         // X8 = f[0]+f[1]        /*                                                LOAD( &S->f[0] )    */ | ||||
|     PXOR       X8, X7         // X7 = X7 ^ X8          /* row4h = _mm_xor_si128(                       ,                  ); */
 | ||||
| 
 | ||||
| @ -155,7 +184,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 = m[0]+m[1] | ||||
|     MOVOU  16(DX), X13        // X13 = m[2]+m[3] | ||||
|     MOVOU  32(DX), X14        // X14 = m[4]+m[5] | ||||
| @ -175,7 +203,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  64(DX), X12        // X12 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
| @ -199,7 +226,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU 112(DX), X12        // X12 = m[14]+m[15] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
| @ -209,7 +235,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  48(DX), X15        // X15 =  m[6]+ m[7] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x11; BYTE $0x6c; BYTE $0xd6   // VPUNPCKLQDQ XMM10, XMM13, XMM14  /* m[10],  m[8] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xdc; BYTE $0x08// VPALIGNR    XMM11, XMM15, XMM12, 0x8  /* m[15],  m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xdc   // VPALIGNR    XMM11, XMM15, XMM12, 0x8  /* m[15],  m[6] */
 | ||||
|                 BYTE $0x08 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
| @ -221,11 +248,11 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xc4; BYTE $0x08 // VPALIGNR     XMM8, XMM12, XMM12, 0x8  /*  m[1],  m[0] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xc4   // VPALIGNR     XMM8, XMM12, XMM12, 0x8  /*  m[1],  m[0] */
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x09; BYTE $0x6d; BYTE $0xcd   // VPUNPCKHQDQ  XMM9, XMM14, XMM13  /* m[11], m[5] */
 | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
| @ -247,12 +274,12 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  32(DX), X12        // X12 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xc5; BYTE $0x08// VPALIGNR     XMM8, XMM14, XMM13, 0x8  /* m[11],  m[12] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xc5   // VPALIGNR     XMM8, XMM14, XMM13, 0x8  /* m[11],  m[12] */
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x19; BYTE $0x6d; BYTE $0xcf   // VPUNPCKHQDQ  XMM9, XMM12, XMM15  /*  m[5], m[15] */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
| @ -271,7 +298,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
| @ -283,7 +309,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     MOVOU  32(DX), X14        // X14 =  m[4]+ m[5] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6c; BYTE $0xd5   // VPUNPCKLQDQ XMM10, XMM15, XMM13  /* m[14], m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xdc; BYTE $0x08// VPALIGNR    XMM11, XMM14, XMM12, 0x8  /*  m[1],  m[4] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xdc   // VPALIGNR    XMM11, XMM14, XMM12, 0x8  /*  m[1],  m[4] */
 | ||||
|                 BYTE $0x08 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
| @ -299,7 +326,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
| @ -322,7 +348,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
| @ -351,7 +376,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
| @ -376,7 +400,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
| @ -388,7 +411,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  64(DX), X13        // X13 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xd4; BYTE $0x08// VPALIGNR    XMM10, XMM14, XMM12, 0x8  /*  m[1], m[12] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xd4   // VPALIGNR    XMM10, XMM14, XMM12, 0x8  /*  m[1], m[12] */
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x09; BYTE $0x6d; BYTE $0xde   // VPUNPCKHQDQ XMM11, XMM14, XMM14  /*   ___, m[13] */
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x11; BYTE $0x6c; BYTE $0xdb   // VPUNPCKLQDQ XMM11, XMM13, XMM11  /*  m[8],  ____ */
 | ||||
| 
 | ||||
| @ -406,7 +430,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X14        // X14 =  m[6]+ m[7] | ||||
| @ -428,7 +451,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  48(DX), X14        // X14 =  m[6]+ m[7] | ||||
| @ -456,7 +478,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
| @ -466,7 +487,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6c; BYTE $0xcd   // VPUNPCKLQDQ  XMM9, XMM15, XMM13  /* m[14],  m[4] */
 | ||||
|     MOVOU  80(DX), X12        // X12 = m[10]+m[11] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x11; BYTE $0x6d; BYTE $0xd7   // VPUNPCKHQDQ XMM10, XMM13, XMM15  /*  m[5], m[15] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xde; BYTE $0x08// VPALIGNR    XMM11, XMM12, XMM14, 0x8  /* m[13], m[10] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xde   // VPALIGNR    XMM11, XMM12, XMM14, 0x8  /* m[13], m[10] */
 | ||||
|                 BYTE $0x08 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
| @ -478,13 +500,13 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X15        // X15 = m[10]+m[11] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x19; BYTE $0x6c; BYTE $0xc5   // VPUNPCKLQDQ  XMM8, XMM12, XMM13  /*  m[0],  m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xce; BYTE $0x08// VPALIGNR     XMM9, XMM14, XMM14, 0x8  /*  m[9],  m[8] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xce   // VPALIGNR     XMM9, XMM14, XMM14, 0x8  /*  m[9],  m[8] */
 | ||||
|                 BYTE $0x08 | ||||
|     MOVOU  16(DX), X14        // X14 =  m[2]+ m[3] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x11; BYTE $0x6d; BYTE $0xd6   // VPUNPCKHQDQ XMM10, XMM13, XMM14  /*  m[7],  m[3] */
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6d; BYTE $0xdf   // VPUNPCKHQDQ XMM11, XMM15, XMM15  /*   ___, m[11] */
 | ||||
| @ -504,7 +526,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
| @ -515,7 +536,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  64(DX), X13        // X13 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xd6; BYTE $0x08// VPALIGNR    XMM10, XMM15, XMM14, 0x8  /* m[11], m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xd6   // VPALIGNR    XMM10, XMM15, XMM14, 0x8  /* m[11], m[14] */
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x19; BYTE $0x6d; BYTE $0xdd   // VPUNPCKHQDQ XMM11, XMM12, XMM13  /*  m[1],  m[9] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| @ -528,7 +550,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
| @ -555,17 +576,18 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x11; BYTE $0x6c; BYTE $0xc7   // VPUNPCKLQDQ  XMM8, XMM13, XMM15  /*  m[6], m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xce; BYTE $0x08// VPALIGNR     XMM9, XMM12, XMM14, 0x8  /* m[11],  m[0] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xce   // VPALIGNR     XMM9, XMM12, XMM14, 0x8  /* m[11],  m[0] */
 | ||||
|                 BYTE $0x08 | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6d; BYTE $0xd6   // VPUNPCKHQDQ XMM10, XMM15, XMM14  /* m[15],  m[9] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xdd; BYTE $0x08// VPALIGNR    XMM11, XMM14, XMM13, 0x8  /*  m[3],  m[8] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xdd   // VPALIGNR    XMM11, XMM14, XMM13, 0x8  /*  m[3],  m[8] */
 | ||||
|                 BYTE $0x08 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
| @ -577,14 +599,14 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X15        // X15 = m[12]+m[13] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6d; BYTE $0xc7   // VPUNPCKHQDQ  XMM8, XMM15, XMM15  /*   ___, m[13] */
 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6c; BYTE $0xc0   // VPUNPCKLQDQ  XMM8, XMM15,  XMM8  /* m[12],  ____ */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xcc; BYTE $0x08// VPALIGNR     XMM9, XMM14, XMM12, 0x8  /*  m[1], m[10] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x09; BYTE $0x0f; BYTE $0xcc   // VPALIGNR     XMM9, XMM14, XMM12, 0x8  /*  m[1], m[10] */
 | ||||
|                 BYTE $0x08 | ||||
|     MOVOU  32(DX), X12        // X12 =  m[4]+ m[5] | ||||
|     MOVOU  48(DX), X15        // X15 =  m[6]+ m[7] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x01; BYTE $0x6d; BYTE $0xd7   // VPUNPCKHQDQ XMM10, XMM15, XMM15  /*   ___,  m[7] */
 | ||||
| @ -606,7 +628,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
| @ -629,7 +650,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  64(DX), X13        // X13 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
| @ -638,7 +658,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x19; BYTE $0x6d; BYTE $0xce   // VPUNPCKHQDQ  XMM9, XMM12, XMM14  /*  m[3], m[13] */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xd5; BYTE $0x08// VPALIGNR    XMM10, XMM15, XMM13, 0x8  /* m[11], m[14] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xd5   // VPALIGNR    XMM10, XMM15, XMM13, 0x8  /* m[11], m[14] */
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x09; BYTE $0x6c; BYTE $0xdc   // VPUNPCKLQDQ XMM11, XMM14, XMM12  /* m[12],  m[0] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| @ -655,7 +676,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 = m[0]+m[1] | ||||
|     MOVOU  16(DX), X13        // X13 = m[2]+m[3] | ||||
|     MOVOU  32(DX), X14        // X14 = m[4]+m[5] | ||||
| @ -675,7 +695,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU  64(DX), X12        // X12 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
| @ -699,7 +718,6 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU 112(DX), X12        // X12 = m[14]+m[15] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
| @ -709,7 +727,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  48(DX), X15        // X15 =  m[6]+ m[7] | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x11; BYTE $0x6c; BYTE $0xd6   // VPUNPCKLQDQ XMM10, XMM13, XMM14  /* m[10],  m[8] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xdc; BYTE $0x08// VPALIGNR    XMM11, XMM15, XMM12, 0x8  /* m[15],  m[6] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x01; BYTE $0x0f; BYTE $0xdc   // VPALIGNR    XMM11, XMM15, XMM12, 0x8  /* m[15],  m[6] */
 | ||||
|                 BYTE $0x08 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
| @ -721,11 +740,11 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xc4; BYTE $0x08 // VPALIGNR     XMM8, XMM12, XMM12, 0x8  /*  m[1],  m[0] */
 | ||||
|     BYTE $0xc4; BYTE $0x43; BYTE $0x19; BYTE $0x0f; BYTE $0xc4   // VPALIGNR     XMM8, XMM12, XMM12, 0x8  /*  m[1],  m[0] */
 | ||||
|                 BYTE $0x08 | ||||
|     BYTE $0xc4; BYTE $0x41; BYTE $0x09; BYTE $0x6d; BYTE $0xcd   // VPUNPCKHQDQ  XMM9, XMM14, XMM13  /* m[11], m[5] */
 | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
| @ -740,8 +759,8 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     // Reload digest | ||||
|     MOVQ   in+24(FP),  SI     // SI: &in | ||||
|     // Reload digest (most current value store in &out) | ||||
|     MOVQ   out+144(FP),  SI     // SI: &in | ||||
|     MOVOU   0(SI), X12        // X12 = in[0]+in[1]      /* row1l = LOAD( &S->h[0] ); */
 | ||||
|     MOVOU  16(SI), X13        // X13 = in[2]+in[3]      /* row1h = LOAD( &S->h[2] ); */
 | ||||
|     MOVOU  32(SI), X14        // X14 = in[4]+in[5]      /* row2l = LOAD( &S->h[4] ); */
 | ||||
| @ -757,12 +776,17 @@ TEXT ·blockAVX(SB), 7, $0 | ||||
|     PXOR   X14, X2           // X2 = X2 ^ X14         /*  STORE( &S->h[4], _mm_xor_si128( LOAD( &S->h[4] ), row2l ) ); */
 | ||||
|     PXOR   X15, X3           // X3 = X3 ^ X15         /*  STORE( &S->h[6], _mm_xor_si128( LOAD( &S->h[6] ), row2h ) ); */
 | ||||
| 
 | ||||
|     // Store digest | ||||
|     MOVQ  out+144(FP), DX     // DX: &out | ||||
|     MOVOU  X0,  0(DX)         // out[0]+out[1] = X0 | ||||
|     MOVOU  X1, 16(DX)         // out[2]+out[3] = X1 | ||||
|     MOVOU  X2, 32(DX)         // out[4]+out[5] = X2 | ||||
|     MOVOU  X3, 48(DX)         // out[6]+out[7] = X3 | ||||
|     // Store digest into &out | ||||
|     MOVQ  out+144(FP), SI     // SI: &out | ||||
|     MOVOU  X0,  0(SI)         // out[0]+out[1] = X0 | ||||
|     MOVOU  X1, 16(SI)         // out[2]+out[3] = X1 | ||||
|     MOVOU  X2, 32(SI)         // out[4]+out[5] = X2 | ||||
|     MOVOU  X3, 48(SI)         // out[6]+out[7] = X3 | ||||
| 
 | ||||
|     // Increment message pointer and check if there's more to do | ||||
|     ADDQ   $128, DX           // message += 128 | ||||
|     SUBQ   $1, R8 | ||||
|     JNZ    loop | ||||
| 
 | ||||
| complete: | ||||
|     RET | ||||
| 
 | ||||
							
								
								
									
										40
									
								
								vendor/github.com/minio/blake2b-simd/compressSse_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										40
									
								
								vendor/github.com/minio/blake2b-simd/compressSse_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,40 @@ | ||||
| //+build !noasm | ||||
| //+build !appengine | ||||
| 
 | ||||
| /* | ||||
|  * Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
|  * | ||||
|  * Licensed under the Apache License, Version 2.0 (the "License"); | ||||
|  * you may not use this file except in compliance with the License. | ||||
|  * You may obtain a copy of the License at | ||||
|  * | ||||
|  *     http://www.apache.org/licenses/LICENSE-2.0 | ||||
|  * | ||||
|  * Unless required by applicable law or agreed to in writing, software | ||||
|  * distributed under the License is distributed on an "AS IS" BASIS, | ||||
|  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
|  * See the License for the specific language governing permissions and | ||||
|  * limitations under the License. | ||||
|  */ | ||||
| 
 | ||||
| package blake2b | ||||
| 
 | ||||
| //go:noescape | ||||
| func blockSSELoop(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| 
 | ||||
| func compressSSE(d *digest, p []uint8) { | ||||
| 
 | ||||
| 	in := make([]uint64, 8, 8) | ||||
| 	out := make([]uint64, 8, 8) | ||||
| 
 | ||||
| 	shffle := make([]uint64, 2, 2) | ||||
| 	// vector for PSHUFB instruction | ||||
| 	shffle[0] = 0x0201000706050403 | ||||
| 	shffle[1] = 0x0a09080f0e0d0c0b | ||||
| 
 | ||||
| 	in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7] = d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] | ||||
| 
 | ||||
| 	blockSSELoop(p, in, iv[:], d.t[:], d.f[:], shffle, out) | ||||
| 
 | ||||
| 	d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] = out[0], out[1], out[2], out[3], out[4], out[5], out[6], out[7] | ||||
| } | ||||
							
								
								
									
										882
									
								
								vendor/github.com/minio/blake2b-simd/compressSse_amd64.s
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										882
									
								
								vendor/github.com/minio/blake2b-simd/compressSse_amd64.s
									
									
									
										generated
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,882 @@ | ||||
| //+build !noasm !appengine | ||||
| 
 | ||||
| // | ||||
| // Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
| // | ||||
| // Licensed under the Apache License, Version 2.0 (the "License");
 | ||||
| // you may not use this file except in compliance with the License. | ||||
| // You may obtain a copy of the License at | ||||
| // | ||||
| //     http://www.apache.org/licenses/LICENSE-2.0 | ||||
| // | ||||
| // Unless required by applicable law or agreed to in writing, software | ||||
| // distributed under the License is distributed on an "AS IS" BASIS, | ||||
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| // See the License for the specific language governing permissions and | ||||
| // limitations under the License. | ||||
| // | ||||
| 
 | ||||
| // | ||||
| // Based on SSE implementation from https://github.com/BLAKE2/BLAKE2/blob/master/sse/blake2b.c | ||||
| // | ||||
| // Use github.com/fwessels/asm2plan9s on this file to assemble instructions to their Plan9 equivalent | ||||
| // | ||||
| // Assembly code below essentially follows the ROUND macro (see blake2b-round.h) which is defined as: | ||||
| //   #define ROUND(r) \ | ||||
| //     LOAD_MSG_ ##r ##_1(b0, b1); \
 | ||||
| //     G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     LOAD_MSG_ ##r ##_2(b0, b1); \
 | ||||
| //     G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h); \
 | ||||
| //     LOAD_MSG_ ##r ##_3(b0, b1); \
 | ||||
| //     G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     LOAD_MSG_ ##r ##_4(b0, b1); \
 | ||||
| //     G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
 | ||||
| //     UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
 | ||||
| // | ||||
| // as well as the go equivalent in https://github.com/dchest/blake2b/blob/master/block.go | ||||
| // | ||||
| // As in the macro, G1/G2 in the 1st and 2nd half are identical (so literal copy of assembly) | ||||
| // | ||||
| // Rounds are also the same, except for the loading of the message (and rounds 1 & 11 and | ||||
| // rounds 2 & 12 are identical) | ||||
| // | ||||
| 
 | ||||
| #define G1 \ | ||||
|     \ // G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1);
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xc0 \ // PADDQ  XMM0,XMM8        /* v0 += m[0], v1 += m[2] */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xc9 \ // PADDQ  XMM1,XMM9        /* v2 += m[4], v3 += m[6] */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xc2             \ // PADDQ  XMM0,XMM2        /* v0 += v4, v1 += v5 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xcb             \ // PADDQ  XMM1,XMM3        /* v2 += v6, v3 += v7 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf0             \ // PXOR   XMM6,XMM0        /* v12 ^= v0, v13 ^= v1 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf9             \ // PXOR   XMM7,XMM1        /* v14 ^= v2, v15 ^= v3 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0x70; BYTE $0xf6; BYTE $0xb1 \ // PSHUFD XMM6,XMM6,0xb1   /* v12 = v12<<(64-32) | v12>>32, v13 = v13<<(64-32) | v13>>32 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0x70; BYTE $0xff; BYTE $0xb1 \ // PSHUFD XMM7,XMM7,0xb1   /* v14 = v14<<(64-32) | v14>>32, v15 = v15<<(64-32) | v15>>32 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xe6             \ // PADDQ  XMM4,XMM6        /* v8 += v12, v9 += v13  */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xef             \ // PADDQ  XMM5,XMM7        /* v10 += v14, v11 += v15 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xd4             \ // PXOR   XMM2,XMM4        /* v4 ^= v8, v5 ^= v9 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xdd             \ // PXOR   XMM3,XMM5        /* v6 ^= v10, v7 ^= v11 */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x38; BYTE $0x00 \ // PSHUFB XMM2,XMM12       /* v4 = v4<<(64-24) | v4>>24, v5 = v5<<(64-24) | v5>>24 */
 | ||||
|                 BYTE $0xd4                                     \ | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x38; BYTE $0x00 \ // PSHUFB XMM3,XMM12       /* v6 = v6<<(64-24) | v6>>24, v7 = v7<<(64-24) | v7>>24 */
 | ||||
|                 BYTE $0xdc                                     \ | ||||
|     // DO NOT DELETE -- macro delimiter (previous line extended) | ||||
| 
 | ||||
| #define G2 \ | ||||
|     \ // G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1);
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xc2 \ // PADDQ  XMM0,XMM10        /* v0 += m[1], v1 += m[3] */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xcb \ // PADDQ  XMM1,XMM11        /* v2 += m[5], v3 += m[7] */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xc2             \ // PADDQ  XMM0,XMM2         /* v0 += v4, v1 += v5 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xcb             \ // PADDQ  XMM1,XMM3         /* v2 += v6, v3 += v7 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf0             \ // PXOR   XMM6,XMM0         /* v12 ^= v0, v13 ^= v1 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf9             \ // PXOR   XMM7,XMM1         /* v14 ^= v2, v15 ^= v3 */
 | ||||
|     BYTE $0xf2; BYTE $0x0f; BYTE $0x70; BYTE $0xf6; BYTE $0x39 \ // PSHUFLW XMM6,XMM6,0x39   /* combined with next ... */
 | ||||
|     BYTE $0xf3; BYTE $0x0f; BYTE $0x70; BYTE $0xf6; BYTE $0x39 \ // PSHUFHW XMM6,XMM6,0x39   /* v12 = v12<<(64-16) | v12>>16, v13 = v13<<(64-16) | v13>>16 */
 | ||||
|     BYTE $0xf2; BYTE $0x0f; BYTE $0x70; BYTE $0xff; BYTE $0x39 \ // PSHUFLW XMM7,XMM7,0x39   /* combined with next ... */
 | ||||
|     BYTE $0xf3; BYTE $0x0f; BYTE $0x70; BYTE $0xff; BYTE $0x39 \ // PSHUFHW XMM7,XMM7,0x39   /* v14 = v14<<(64-16) | v14>>16, v15 = v15<<(64-16) | v15>>16 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xe6             \ // PADDQ  XMM4,XMM6         /* v8 += v12, v9 += v13 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xef             \ // PADDQ  XMM5,XMM7         /* v10 += v14, v11 += v15 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xd4             \ // PXOR   XMM2,XMM4         /* v4 ^= v8, v5 ^= v9 */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xdd             \ // PXOR   XMM3,XMM5         /* v6 ^= v10, v7 ^= v11 */
 | ||||
|     MOVOU X2, X15 \ | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0xd4; BYTE $0xfa \ // PADDQ  XMM15,XMM2        /* temp reg = reg*2   */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0x73; BYTE $0xd2; BYTE $0x3f \ // PSRLQ  XMM2,0x3f         /*      reg = reg>>63 */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xef; BYTE $0xd7 \ // PXOR   XMM2,XMM15        /* ORed together: v4 = v4<<(64-63) | v4>>63, v5 = v5<<(64-63) | v5>>63 */
 | ||||
|     MOVOU X3, X15 \ | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0xd4; BYTE $0xfb \ // PADDQ XMM15,XMM3         /* temp reg = reg*2   */
 | ||||
|     BYTE $0x66; BYTE $0x0f; BYTE $0x73; BYTE $0xd3; BYTE $0x3f \ // PSRLQ XMM3,0x3f          /*      reg = reg>>63 */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xef; BYTE $0xdf   // PXOR  XMM3,XMM15         /* ORed together: v6 = v6<<(64-63) | v6>>63, v7 = v7<<(64-63) | v7>>63 */
 | ||||
| 
 | ||||
| #define DIAGONALIZE \ | ||||
|     \ // DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
 | ||||
|     MOVOU  X6, X13 \                                                                                 /*  t0 = row4l;\                                                           */
 | ||||
|     MOVOU  X2, X14 \                                                                                 /*  t1 = row2l;\                                                           */
 | ||||
|     MOVOU  X4, X6  \                                                                                 /*  row4l = row3l;\                                                        */
 | ||||
|     MOVOU  X5, X4 \                                                                                  /*  row3l = row3h;\                                                        */
 | ||||
|     MOVOU  X6, X5 \                                                                                  /*  row3h = row4l;\                                                        */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xfd \ // PUNPCKLQDQ XMM15, XMM13          /*                                    _mm_unpacklo_epi64(t0, t0)           */
 | ||||
|     MOVOU  X7, X6 \ | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xf7 \ // PUNPCKHQDQ  XMM6, XMM15          /*  row4l = _mm_unpackhi_epi64(row4h,                           ); \       */
 | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xff \ // PUNPCKLQDQ XMM15,  XMM7          /*                                 _mm_unpacklo_epi64(row4h, row4h)        */
 | ||||
|     MOVOU X13, X7 \ | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xff \ // PUNPCKHQDQ  XMM7, XMM15          /*  row4h = _mm_unpackhi_epi64(t0,                                 ); \    */
 | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xfb \ // PUNPCKLQDQ XMM15,  XMM3          /*                                    _mm_unpacklo_epi64(row2h, row2h)     */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7 \ // PUNPCKHQDQ  XMM2, XMM15          /*  row2l = _mm_unpackhi_epi64(row2l,                                 ); \ */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xfe \ // PUNPCKLQDQ XMM15, XMM14          /*                                    _mm_unpacklo_epi64(t1, t1)           */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf   // PUNPCKHQDQ  XMM3, XMM15          /*  row2h = _mm_unpackhi_epi64(row2h,                           )          */
 | ||||
| 
 | ||||
| #define UNDIAGONALIZE \ | ||||
|     \ // UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
 | ||||
|     MOVOU  X4, X13 \                                                                                 /* t0 = row3l;\                                                            */
 | ||||
|     MOVOU  X5, X4 \                                                                                  /* row3l = row3h;\                                                         */
 | ||||
|     MOVOU X13, X5 \                                                                                  /* row3h = t0;\                                                            */
 | ||||
|     MOVOU  X2, X13 \                                                                                 /* t0 = row2l;\                                                            */
 | ||||
|     MOVOU  X6, X14 \                                                                                 /* t1 = row4l;\                                                            */
 | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xfa \ // PUNPCKLQDQ XMM15,  XMM2          /*                                    _mm_unpacklo_epi64(row2l, row2l)     */
 | ||||
|     MOVOU  X3, X2 \ | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7 \ // PUNPCKHQDQ  XMM2, XMM15          /*  row2l = _mm_unpackhi_epi64(row2h,                                 ); \ */
 | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xfb \ // PUNPCKLQDQ XMM15,  XMM3          /*                                 _mm_unpacklo_epi64(row2h, row2h)        */
 | ||||
|     MOVOU X13, X3 \ | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf \ // PUNPCKHQDQ  XMM3, XMM15          /*  row2h = _mm_unpackhi_epi64(t0,                                 ); \    */
 | ||||
|     BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xff \ // PUNPCKLQDQ XMM15,  XMM7          /*                                    _mm_unpacklo_epi64(row4h, row4h)     */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xf7 \ // PUNPCKHQDQ  XMM6, XMM15          /*  row4l = _mm_unpackhi_epi64(row4l,                                 ); \ */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xfe \ // PUNPCKLQDQ XMM15, XMM14          /*                                    _mm_unpacklo_epi64(t1, t1)           */
 | ||||
|     BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xff   // PUNPCKHQDQ  XMM7, XMM15          /*  row4h = _mm_unpackhi_epi64(row4h,                           )          */
 | ||||
| 
 | ||||
| #define LOAD_SHUFFLE \ | ||||
|     \ // Load shuffle value | ||||
|     MOVQ   shffle+120(FP), SI \ // SI: &shuffle | ||||
|     MOVOU  0(SI), X12           // X12 = 03040506 07000102 0b0c0d0e 0f08090a | ||||
| 
 | ||||
| // func blockSSELoop(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| TEXT ·blockSSELoop(SB), 7, $0 | ||||
|     // REGISTER USE | ||||
|     //        R8: loop counter | ||||
|     //        DX: message pointer | ||||
|     //        SI: temp pointer for loading | ||||
|     //  X0 -  X7: v0 - v15 | ||||
|     //  X8 - X11: m[0] - m[7] | ||||
|     //       X12: shuffle value | ||||
|     // X13 - X15: temp registers | ||||
| 
 | ||||
|     // Load digest | ||||
|     MOVQ   in+24(FP),  SI     // SI: &in | ||||
|     MOVOU   0(SI), X0         // X0 = in[0]+in[1]      /* row1l = LOAD( &S->h[0] ); */
 | ||||
|     MOVOU  16(SI), X1         // X1 = in[2]+in[3]      /* row1h = LOAD( &S->h[2] ); */
 | ||||
|     MOVOU  32(SI), X2         // X2 = in[4]+in[5]      /* row2l = LOAD( &S->h[4] ); */
 | ||||
|     MOVOU  48(SI), X3         // X3 = in[6]+in[7]      /* row2h = LOAD( &S->h[6] ); */
 | ||||
| 
 | ||||
|     // Already store digest into &out (so we can reload it later generically) | ||||
|     MOVQ  out+144(FP), SI     // SI: &out | ||||
|     MOVOU  X0,  0(SI)         // out[0]+out[1] = X0 | ||||
|     MOVOU  X1, 16(SI)         // out[2]+out[3] = X1 | ||||
|     MOVOU  X2, 32(SI)         // out[4]+out[5] = X2 | ||||
|     MOVOU  X3, 48(SI)         // out[6]+out[7] = X3 | ||||
| 
 | ||||
|     // Initialize message pointer and loop counter | ||||
|     MOVQ   message+0(FP), DX  // DX: &p (message) | ||||
|     MOVQ   message_len+8(FP), R8 // R8: len(message) | ||||
|     SHRQ   $7, R8             // len(message) / 128 | ||||
|     CMPQ   R8, $0 | ||||
|     JEQ    complete | ||||
| 
 | ||||
| loop: | ||||
|     // Increment counter | ||||
|     MOVQ t+72(FP), SI         // SI: &t | ||||
|     MOVQ   0(SI), R9          // | ||||
|     ADDQ   $128, R9           //                       /* d.t[0] += BlockSize */ | ||||
|     MOVQ   R9, 0(SI)          // | ||||
|     CMPQ   R9, $128           //                       /* if d.t[0] < BlockSize { */ | ||||
|     JGE    noincr             // | ||||
|     MOVQ   8(SI), R9          // | ||||
|     ADDQ   $1, R9             //                       /*     d.t[1]++ */ | ||||
|     MOVQ   R9, 8(SI)          // | ||||
| noincr:                       //                       /* } */ | ||||
| 
 | ||||
|     // Load initialization vector | ||||
|     MOVQ iv+48(FP), SI        // SI: &iv | ||||
|     MOVOU   0(SI), X4         // X4 = iv[0]+iv[1]      /* row3l = LOAD( &blake2b_IV[0] ); */
 | ||||
|     MOVOU  16(SI), X5         // X5 = iv[2]+iv[3]      /* row3h = LOAD( &blake2b_IV[2] ); */
 | ||||
|     MOVOU  32(SI), X6         // X6 = iv[4]+iv[5]      /*                        LOAD( &blake2b_IV[4] )                      */ | ||||
|     MOVOU  48(SI), X7         // X7 = iv[6]+iv[7]      /*                        LOAD( &blake2b_IV[6] )                      */ | ||||
|     MOVQ t+72(FP), SI         // SI: &t | ||||
|     MOVOU   0(SI), X8         // X8 = t[0]+t[1]        /*                                                LOAD( &S->t[0] )    */ | ||||
|     PXOR       X8, X6         // X6 = X6 ^ X8          /* row4l = _mm_xor_si128(                       ,                  ); */
 | ||||
|     MOVQ t+96(FP), SI         // SI: &f | ||||
|     MOVOU   0(SI), X8         // X8 = f[0]+f[1]        /*                                                LOAD( &S->f[0] )    */ | ||||
|     PXOR       X8, X7         // X7 = X7 ^ X8          /* row4h = _mm_xor_si128(                       ,                  ); */
 | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 = m[0]+m[1] | ||||
|     MOVOU  16(DX), X13        // X13 = m[2]+m[3] | ||||
|     MOVOU  32(DX), X14        // X14 = m[4]+m[5] | ||||
|     MOVOU  48(DX), X15        // X15 = m[6]+m[7] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /* m[0], m[2] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf   // PUNPCKLQDQ  XMM9, XMM15  /* m[4], m[6] */
 | ||||
|     MOVOU     X12, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5   // PUNPCKHQDQ XMM10, XMM13  /* m[1], m[3] */
 | ||||
|     MOVOU     X14, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf   // PUNPCKHQDQ XMM11, XMM15  /* m[5], m[7] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  64(DX), X12        // X12 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /*  m[8],m[10] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf   // PUNPCKLQDQ  XMM9, XMM15  /* m[12],m[14] */
 | ||||
|     MOVOU     X12, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5   // PUNPCKHQDQ XMM10, XMM13  /*  m[9],m[11] */
 | ||||
|     MOVOU     X14, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf   // PUNPCKHQDQ XMM11, XMM15  /* m[13],m[15] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   2 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU 112(DX), X12        // X12 = m[14]+m[15] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X15        // X15 = m[12]+m[13] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /* m[14],  m[4] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf   // PUNPCKHQDQ  XMM9, XMM15  /*  m[9], m[13] */
 | ||||
|     MOVOU  80(DX), X10        // X10 = m[10]+m[11] | ||||
|     MOVOU  48(DX), X11        // X11 =  m[6]+ m[7] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6   // PUNPCKLQDQ XMM10, XMM14  /* m[10],  m[8] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM11, XMM12, 0x8  /* m[15],  m[6] */; ; ; ; ; 
 | ||||
|                 BYTE $0xdc; BYTE $0x08
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR     XMM8, XMM12, 0x8  /*  m[1],  m[0] */
 | ||||
|                 BYTE $0xc4; BYTE $0x08
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcd   // PUNPCKHQDQ  XMM9, XMM13  /* m[11], m[5] */
 | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X11        // X11 =  m[6]+ m[7] | ||||
|     MOVOU  96(DX), X10        // X10 = m[12]+m[13] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4   // PUNPCKLQDQ XMM10, XMM12  /* m[12], m[2] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdc   // PUNPCKHQDQ XMM11, XMM12  /*  m[7], m[3] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   3 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  32(DX), X12        // X12 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X14, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR     XMM8, XMM13, 0x8  /* m[11],  m[12] */
 | ||||
|                 BYTE $0xc5; BYTE $0x08
 | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf   // PUNPCKHQDQ  XMM9, XMM15  /*  m[5], m[15] */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  64(DX), X10        // X10 =  m[8]+ m[9] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4   // PUNPCKLQDQ XMM10, XMM12  /*  m[8],  m[0] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6   // PUNPCKHQDQ XMM14, XMM14  /*   ___, m[13] */
 | ||||
|     MOVOU     X13, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xde   // PUNPCKLQDQ XMM11, XMM14  /*  m[2],   ___ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X15        // X15 = m[10]+m[11] | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc   // PUNPCKHQDQ  XMM9, XMM12  /*   ___, m[3] */
 | ||||
|     MOVOU     X15, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1   // PUNPCKLQDQ  XMM8,  XMM9  /* m[10],  ___ */
 | ||||
|     MOVOU     X13, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce   // PUNPCKHQDQ  XMM9, XMM14  /*  m[7], m[9] */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X11        // X11 =  m[4]+ m[5] | ||||
|     MOVOU 112(DX), X10        // X10 = m[14]+m[15] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd5   // PUNPCKLQDQ XMM10, XMM13  /* m[14], m[6] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM11, XMM12, 0x8  /*  m[1],  m[4] */
 | ||||
|                 BYTE $0xdc; BYTE $0x08
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   4 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X15        // X15 = m[12]+m[13] | ||||
|     MOVOU     X13, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc4   // PUNPCKHQDQ  XMM8, XMM12  /*  m[7],  m[3] */
 | ||||
|     MOVOU     X15, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce   // PUNPCKHQDQ  XMM9, XMM14  /* m[13], m[11] */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  64(DX), X10        // X10 =  m[8]+ m[9] | ||||
|     MOVOU 112(DX), X14        // X14 = m[14]+m[15] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd4   // PUNPCKHQDQ XMM10, XMM12  /*  m[9],  m[1] */
 | ||||
|     MOVOU     X15, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xde   // PUNPCKLQDQ XMM11, XMM14  /* m[12], m[14] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X13, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcd   // PUNPCKHQDQ  XMM9, XMM13  /*   ___,  m[5] */
 | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1   // PUNPCKLQDQ  XMM8,  XMM9  /*  m[2],  ____ */
 | ||||
|     MOVOU     X15, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7   // PUNPCKHQDQ XMM10, XMM15  /*   ___, m[15] */
 | ||||
|     MOVOU     X13, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xca   // PUNPCKLQDQ  XMM9, XMM10  /*  m[4],  ____ */
 | ||||
|     MOVOU   0(DX), X11        // X11 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X10        // X10 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X15        // X15 =  m[8]+ m[9] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6   // PUNPCKLQDQ XMM10, XMM14  /*  m[6], m[10] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf   // PUNPCKLQDQ XMM11, XMM15  /*  m[0],  m[8] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   5 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X15        // X15 = m[10]+m[11] | ||||
|     MOVOU     X14, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc5   // PUNPCKHQDQ  XMM8, XMM13  /*  m[9],  m[5] */
 | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf   // PUNPCKLQDQ  XMM9, XMM15  /*  m[2], m[10] */
 | ||||
|     MOVOU   0(DX), X10        // X10 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X14        // X14 =  m[6]+ m[7] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6   // PUNPCKHQDQ XMM14, XMM14  /*   ___,  m[7] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6   // PUNPCKLQDQ XMM10, XMM14  /*  m[0],  ____ */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xff   // PUNPCKHQDQ XMM15, XMM15  /*   ___, m[15] */
 | ||||
|     MOVOU     X13, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf   // PUNPCKLQDQ XMM11, XMM15  /*  m[4],  ____ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6   // PUNPCKHQDQ  XMM14, XMM14  /*   ___, m[11] */
 | ||||
|     MOVOU     X15, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc6   // PUNPCKLQDQ  XMM8,  XMM14  /* m[14],  ____ */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xe4   // PUNPCKHQDQ  XMM12, XMM12  /*   ___,  m[3] */
 | ||||
|     MOVOU     X13, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcc   // PUNPCKLQDQ  XMM9,  XMM12  /*  m[6],  ____ */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  64(DX), X11        // X11 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU     X14, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM10, XMM12, 0x8  /*  m[1], m[12] */
 | ||||
|                 BYTE $0xd4; BYTE $0x08
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6   // PUNPCKHQDQ XMM14, XMM14  /*   ___, m[13] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xde   // PUNPCKLQDQ XMM11, XMM14  /*  m[8],  ____ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   6 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X14        // X14 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X15        // X15 =  m[8]+ m[9] | ||||
|     MOVOU     X13, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc6   // PUNPCKLQDQ  XMM8, XMM14  /*  m[2],  m[6] */
 | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf   // PUNPCKLQDQ  XMM9, XMM15  /*  m[0],  m[8] */
 | ||||
|     MOVOU  80(DX), X12        // X12 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X10        // X10 = m[12]+m[13] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4   // PUNPCKLQDQ XMM10, XMM12  /* m[12], m[10] */
 | ||||
|     MOVOU     X12, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdd   // PUNPCKHQDQ XMM11, XMM13  /* m[11],  m[3] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  48(DX), X14        // X14 =  m[6]+ m[7] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce   // PUNPCKHQDQ  XMM9, XMM14  /*   ___,  m[7] */
 | ||||
|     MOVOU     X13, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1   // PUNPCKLQDQ  XMM8,  XMM9  /*  m[4],  ____ */
 | ||||
|     MOVOU     X15, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc   // PUNPCKHQDQ  XMM9, XMM12  /* m[15],  m[1] */
 | ||||
|     MOVOU  64(DX), X12        // X12 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X10        // X10 = m[12]+m[13] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5   // PUNPCKHQDQ XMM10, XMM13  /* m[13],  m[5] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xe4   // PUNPCKHQDQ XMM12, XMM12  /*   ___,  m[9] */
 | ||||
|     MOVOU     X15, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdc   // PUNPCKLQDQ XMM11, XMM12  /* m[14],  ____ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   7 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc   // PUNPCKHQDQ  XMM9, XMM12  /*   ___,  m[1] */
 | ||||
|     MOVOU     X14, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1   // PUNPCKLQDQ  XMM8,  XMM9  /* m[12],  ____ */
 | ||||
|     MOVOU     X15, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcd   // PUNPCKLQDQ  XMM9, XMM13  /* m[14],  m[4] */
 | ||||
|     MOVOU  80(DX), X11        // X11 = m[10]+m[11] | ||||
|     MOVOU     X13, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7   // PUNPCKHQDQ XMM10, XMM15  /*  m[5], m[15] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM11, XMM14, 0x8  /* m[13], m[10] */
 | ||||
|                 BYTE $0xde; BYTE $0x08
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X15        // X15 = m[10]+m[11] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /*  m[0],  m[6] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR     XMM9, XMM14, 0x8  /*  m[9],  m[8] */
 | ||||
|                 BYTE $0xce; BYTE $0x08
 | ||||
|     MOVOU  16(DX), X11        // X14 =  m[2]+ m[3] | ||||
|     MOVOU     X13, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd3   // PUNPCKHQDQ XMM10, XMM11  /*  m[7],  m[3] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xff   // PUNPCKHQDQ XMM15, XMM15  /*   ___, m[11] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf   // PUNPCKLQDQ XMM11, XMM15  /*  m[2],  ____ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   8 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X14, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc5   // PUNPCKHQDQ  XMM8, XMM13  /* m[13],  m[7] */
 | ||||
|     MOVOU     X12, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd4   // PUNPCKHQDQ XMM10, XMM12  /*   ___,  m[3] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xca   // PUNPCKLQDQ  XMM9,  XMM10  /* m[12],  ____ */
 | ||||
|     MOVOU   0(DX), X11        // X11 =  m[0]+ m[1] | ||||
|     MOVOU  64(DX), X13        // X13 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU     X15, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM10, XMM14, 0x8  /* m[11], m[14] */
 | ||||
|                 BYTE $0xd6; BYTE $0x08
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdd   // PUNPCKHQDQ XMM11, XMM13  /*  m[1],  m[9] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X13, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc7   // PUNPCKHQDQ  XMM8, XMM15  /*  m[5], m[15] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcc   // PUNPCKLQDQ  XMM9, XMM12  /*  m[8],  m[2] */
 | ||||
|     MOVOU   0(DX), X10        // X10 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X11        // X11 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X15        // X15 = m[10]+m[11] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd5   // PUNPCKLQDQ XMM10, XMM13  /*  m[0],  m[4] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf   // PUNPCKLQDQ XMM11, XMM15  /*  m[6], m[10] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   9 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X13, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc7   // PUNPCKLQDQ  XMM8, XMM15  /*  m[6], m[14] */
 | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR     XMM9, XMM14, 0x8  /* m[11],  m[0] */
 | ||||
|                 BYTE $0xce; BYTE $0x08
 | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  64(DX), X11        // X11 =  m[8]+ m[9] | ||||
|     MOVOU     X15, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd3   // PUNPCKHQDQ XMM10, XMM11  /* m[15],  m[9] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM11, XMM13, 0x8  /*  m[3],  m[8] */
 | ||||
|                 BYTE $0xdd; BYTE $0x08
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  16(DX), X13        // X13 =  m[2]+ m[3] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X15        // X15 = m[12]+m[13] | ||||
|     MOVOU     X15, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf   // PUNPCKHQDQ  XMM9, XMM15  /*   ___, m[13] */
 | ||||
|     MOVOU     X15, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1   // PUNPCKLQDQ  XMM8,  XMM9  /* m[12],  ____ */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR     XMM9, XMM12, 0x8  /*  m[1], m[10] */
 | ||||
|                 BYTE $0xcc; BYTE $0x08
 | ||||
|     MOVOU  32(DX), X12        // X12 =  m[4]+ m[5] | ||||
|     MOVOU  48(DX), X15        // X15 =  m[6]+ m[7] | ||||
|     MOVOU     X15, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf   // PUNPCKHQDQ XMM11, XMM15  /*   ___,  m[7] */
 | ||||
|     MOVOU     X13, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd3   // PUNPCKLQDQ XMM10, XMM11  /*  m[2],  ____ */
 | ||||
|     MOVOU     X12, X15 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xfc   // PUNPCKHQDQ XMM15, XMM12  /*   ___,  m[5] */
 | ||||
|     MOVOU     X12, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf   // PUNPCKLQDQ XMM11, XMM15  /*  m[4],  ____ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 0 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  48(DX), X13        // X13 =  m[6]+ m[7] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X15        // X15 = m[10]+m[11] | ||||
|     MOVOU     X15, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc6   // PUNPCKLQDQ  XMM8, XMM14  /* m[10],  m[8] */
 | ||||
|     MOVOU     X13, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc   // PUNPCKHQDQ  XMM9, XMM12  /*  m[7],  m[1] */
 | ||||
|     MOVOU  16(DX), X10        // X10 =  m[2]+ m[3] | ||||
|     MOVOU  32(DX), X14        // X14 =  m[4]+ m[5] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6   // PUNPCKLQDQ XMM10, XMM14  /*  m[2],  m[4] */
 | ||||
|     MOVOU     X14, X15 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xfe   // PUNPCKHQDQ XMM15, XMM14  /*   ___,  m[5] */
 | ||||
|     MOVOU     X13, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf   // PUNPCKLQDQ XMM11, XMM15  /*  m[6],  ____ */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  64(DX), X13        // X13 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X15, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc5   // PUNPCKHQDQ  XMM8, XMM13  /* m[15],  m[9] */
 | ||||
|     MOVOU     X12, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce   // PUNPCKHQDQ  XMM9, XMM14  /*  m[3], m[13] */
 | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU     X15, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM10, XMM13, 0x8  /* m[11], m[14] */
 | ||||
|                 BYTE $0xd5; BYTE $0x08
 | ||||
|     MOVOU     X14, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdc   // PUNPCKLQDQ XMM11, XMM12  /* m[12],  m[0] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 1 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 = m[0]+m[1] | ||||
|     MOVOU  16(DX), X13        // X13 = m[2]+m[3] | ||||
|     MOVOU  32(DX), X14        // X14 = m[4]+m[5] | ||||
|     MOVOU  48(DX), X15        // X15 = m[6]+m[7] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /* m[0], m[2] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf   // PUNPCKLQDQ  XMM9, XMM15  /* m[4], m[6] */
 | ||||
|     MOVOU     X12, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5   // PUNPCKHQDQ XMM10, XMM13  /* m[1], m[3] */
 | ||||
|     MOVOU     X14, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf   // PUNPCKHQDQ XMM11, XMM15  /* m[5], m[7] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU  64(DX), X12        // X12 =  m[8]+ m[9] | ||||
|     MOVOU  80(DX), X13        // X13 = m[10]+m[11] | ||||
|     MOVOU  96(DX), X14        // X14 = m[12]+m[13] | ||||
|     MOVOU 112(DX), X15        // X15 = m[14]+m[15] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /*  m[8],m[10] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf   // PUNPCKLQDQ  XMM9, XMM15  /* m[12],m[14] */
 | ||||
|     MOVOU     X12, X10 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5   // PUNPCKHQDQ XMM10, XMM13  /*  m[9],m[11] */
 | ||||
|     MOVOU     X14, X11 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf   // PUNPCKHQDQ XMM11, XMM15  /* m[13],m[15] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
|     // R O U N D   1 2 | ||||
|     /////////////////////////////////////////////////////////////////////////// | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_1(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_2(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU 112(DX), X12        // X12 = m[14]+m[15] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  64(DX), X14        // X14 =  m[8]+ m[9] | ||||
|     MOVOU  96(DX), X15        // X15 = m[12]+m[13] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5   // PUNPCKLQDQ  XMM8, XMM13  /* m[14],  m[4] */
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf   // PUNPCKHQDQ  XMM9, XMM15  /*  m[9], m[13] */
 | ||||
|     MOVOU  80(DX), X10        // X10 = m[10]+m[11] | ||||
|     MOVOU  48(DX), X11        // X11 =  m[6]+ m[7] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6   // PUNPCKLQDQ XMM10, XMM14  /* m[10],  m[8] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR    XMM11, XMM12, 0x8  /* m[15],  m[6] */; ; ; ; ;
 | ||||
|                 BYTE $0xdc; BYTE $0x08
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     DIAGONALIZE | ||||
| 
 | ||||
|     // LOAD_MSG_ ##r ##_3(b0, b1);
 | ||||
|     // LOAD_MSG_ ##r ##_4(b0, b1);
 | ||||
|     //   (X12 used as additional temp register) | ||||
|     MOVOU   0(DX), X12        // X12 =  m[0]+ m[1] | ||||
|     MOVOU  32(DX), X13        // X13 =  m[4]+ m[5] | ||||
|     MOVOU  80(DX), X14        // X14 = m[10]+m[11] | ||||
|     MOVOU     X12, X8 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f   // PALIGNR     XMM8, XMM12, 0x8  /*  m[1],  m[0] */
 | ||||
|                 BYTE $0xc4; BYTE $0x08
 | ||||
|     MOVOU     X14, X9 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcd   // PUNPCKHQDQ  XMM9, XMM13  /* m[11], m[5] */
 | ||||
|     MOVOU  16(DX), X12        // X12 =  m[2]+ m[3] | ||||
|     MOVOU  48(DX), X11        // X11 =  m[6]+ m[7] | ||||
|     MOVOU  96(DX), X10        // X10 = m[12]+m[13] | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4   // PUNPCKLQDQ XMM10, XMM12  /* m[12], m[2] */
 | ||||
|     BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdc   // PUNPCKHQDQ XMM11, XMM12  /*  m[7], m[3] */
 | ||||
| 
 | ||||
|     LOAD_SHUFFLE | ||||
| 
 | ||||
|     G1 | ||||
|     G2 | ||||
| 
 | ||||
|     UNDIAGONALIZE | ||||
| 
 | ||||
|     // Reload digest (most current value store in &out) | ||||
|     MOVQ   out+144(FP),  SI     // SI: &in | ||||
|     MOVOU   0(SI), X12        // X12 = in[0]+in[1]      /* row1l = LOAD( &S->h[0] ); */
 | ||||
|     MOVOU  16(SI), X13        // X13 = in[2]+in[3]      /* row1h = LOAD( &S->h[2] ); */
 | ||||
|     MOVOU  32(SI), X14        // X14 = in[4]+in[5]      /* row2l = LOAD( &S->h[4] ); */
 | ||||
|     MOVOU  48(SI), X15        // X15 = in[6]+in[7]      /* row2h = LOAD( &S->h[6] ); */
 | ||||
| 
 | ||||
|     // Final computations and prepare for storing | ||||
|     PXOR   X4,  X0           // X0 = X0 ^ X4          /* row1l = _mm_xor_si128( row3l, row1l ); */
 | ||||
|     PXOR   X5,  X1           // X1 = X1 ^ X5          /* row1h = _mm_xor_si128( row3h, row1h ); */
 | ||||
|     PXOR   X12, X0           // X0 = X0 ^ X12         /*  STORE( &S->h[0], _mm_xor_si128( LOAD( &S->h[0] ), row1l ) ); */
 | ||||
|     PXOR   X13, X1           // X1 = X1 ^ X13         /*  STORE( &S->h[2], _mm_xor_si128( LOAD( &S->h[2] ), row1h ) ); */
 | ||||
|     PXOR   X6,  X2           // X2 = X2 ^ X6          /*  row2l = _mm_xor_si128( row4l, row2l ); */
 | ||||
|     PXOR   X7,  X3           // X3 = X3 ^ X7          /*  row2h = _mm_xor_si128( row4h, row2h ); */
 | ||||
|     PXOR   X14, X2           // X2 = X2 ^ X14         /*  STORE( &S->h[4], _mm_xor_si128( LOAD( &S->h[4] ), row2l ) ); */
 | ||||
|     PXOR   X15, X3           // X3 = X3 ^ X15         /*  STORE( &S->h[6], _mm_xor_si128( LOAD( &S->h[6] ), row2h ) ); */
 | ||||
| 
 | ||||
|     // Store digest into &out | ||||
|     MOVQ  out+144(FP), SI     // SI: &out | ||||
|     MOVOU  X0,  0(SI)         // out[0]+out[1] = X0 | ||||
|     MOVOU  X1, 16(SI)         // out[2]+out[3] = X1 | ||||
|     MOVOU  X2, 32(SI)         // out[4]+out[5] = X2 | ||||
|     MOVOU  X3, 48(SI)         // out[6]+out[7] = X3 | ||||
| 
 | ||||
|     // Increment message pointer and check if there's more to do | ||||
|     ADDQ   $128, DX           // message += 128 | ||||
|     SUBQ   $1, R8 | ||||
|     JNZ    loop | ||||
| 
 | ||||
| complete: | ||||
|     RET | ||||
							
								
								
									
										50
									
								
								vendor/github.com/minio/blake2b-simd/compress_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
								
							
							
						
						
									
										50
									
								
								vendor/github.com/minio/blake2b-simd/compress_amd64.go
									
									
									
										generated
									
									
										vendored
									
									
								
							| @ -1,6 +1,3 @@ | ||||
| //+build !noasm | ||||
| //+build !appengine | ||||
| 
 | ||||
| /* | ||||
|  * Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
|  * | ||||
| @ -19,44 +16,15 @@ | ||||
| 
 | ||||
| package blake2b | ||||
| 
 | ||||
| //go:noescape | ||||
| func blockAVX(p []uint8, in, iv, t, f, shffle, out []uint64) | ||||
| 
 | ||||
| func compressAVX(d *digest, p []uint8) { | ||||
| 	h0, h1, h2, h3, h4, h5, h6, h7 := d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] | ||||
| 
 | ||||
| 	in := make([]uint64, 8, 8) | ||||
| 	out := make([]uint64, 8, 8) | ||||
| 
 | ||||
| 	shffle := make([]uint64, 2, 2) | ||||
| 	// vector for PSHUFB instruction | ||||
| 	shffle[0] = 0x0201000706050403 | ||||
| 	shffle[1] = 0x0a09080f0e0d0c0b | ||||
| 
 | ||||
| 	for len(p) >= BlockSize { | ||||
| 		// Increment counter. | ||||
| 		d.t[0] += BlockSize | ||||
| 		if d.t[0] < BlockSize { | ||||
| 			d.t[1]++ | ||||
| 		} | ||||
| 
 | ||||
| 		in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7] = h0, h1, h2, h3, h4, h5, h6, h7 | ||||
| 
 | ||||
| 		blockAVX(p, in, iv[:], d.t[:], d.f[:], shffle, out) | ||||
| 
 | ||||
| 		h0, h1, h2, h3, h4, h5, h6, h7 = out[0], out[1], out[2], out[3], out[4], out[5], out[6], out[7] | ||||
| 
 | ||||
| 		p = p[BlockSize:] | ||||
| 	} | ||||
| 
 | ||||
| 	d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] = h0, h1, h2, h3, h4, h5, h6, h7 | ||||
| } | ||||
| 
 | ||||
| func compress(d *digest, p []uint8) { | ||||
| 	// Verifies if AVX is available, use optimized code path. | ||||
| 	if avx { | ||||
| 	// Verifies if AVX2 or AVX is available, use optimized code path. | ||||
| 	if avx2 { | ||||
| 		compressAVX2(d, p) | ||||
| 	} else if avx { | ||||
| 		compressAVX(d, p) | ||||
| 		return | ||||
| 	} // else { fallback to generic approach. | ||||
| 	compressGeneric(d, p) | ||||
| 	} else if ssse3 { | ||||
| 		compressSSE(d, p) | ||||
| 	} else { | ||||
| 		compressGeneric(d, p) | ||||
| 	} | ||||
| } | ||||
|  | ||||
							
								
								
									
										21
									
								
								vendor/github.com/minio/blake2b-simd/compress_noasm.go
									
									
									
										generated
									
									
										vendored
									
									
								
							
							
						
						
									
										21
									
								
								vendor/github.com/minio/blake2b-simd/compress_noasm.go
									
									
									
										generated
									
									
										vendored
									
									
								
							| @ -1,11 +1,20 @@ | ||||
| //+build !amd64 noasm appengine | ||||
| 
 | ||||
| // Written in 2012 by Dmitry Chestnykh. | ||||
| // | ||||
| // To the extent possible under law, the author have dedicated all copyright | ||||
| // and related and neighboring rights to this software to the public domain | ||||
| // worldwide. This software is distributed without any warranty. | ||||
| // http://creativecommons.org/publicdomain/zero/1.0/ | ||||
| /* | ||||
|  * Minio Cloud Storage, (C) 2016 Minio, Inc. | ||||
|  * | ||||
|  * Licensed under the Apache License, Version 2.0 (the "License"); | ||||
|  * you may not use this file except in compliance with the License. | ||||
|  * You may obtain a copy of the License at | ||||
|  * | ||||
|  *     http://www.apache.org/licenses/LICENSE-2.0 | ||||
|  * | ||||
|  * Unless required by applicable law or agreed to in writing, software | ||||
|  * distributed under the License is distributed on an "AS IS" BASIS, | ||||
|  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
|  * See the License for the specific language governing permissions and | ||||
|  * limitations under the License. | ||||
|  */ | ||||
| 
 | ||||
| package blake2b | ||||
| 
 | ||||
|  | ||||
							
								
								
									
										25
									
								
								vendor/github.com/minio/blake2b-simd/cpuid.go
									
									
									
										generated
									
									
										vendored
									
									
								
							
							
						
						
									
										25
									
								
								vendor/github.com/minio/blake2b-simd/cpuid.go
									
									
									
										generated
									
									
										vendored
									
									
								
							| @ -18,12 +18,15 @@ | ||||
| package blake2b | ||||
| 
 | ||||
| func cpuid(op uint32) (eax, ebx, ecx, edx uint32) | ||||
| func cpuidex(op, op2 uint32) (eax, ebx, ecx, edx uint32) | ||||
| func xgetbv(index uint32) (eax, edx uint32) | ||||
| 
 | ||||
| // True when SIMD instructions are available. | ||||
| var avx2 = haveAVX2() | ||||
| var avx = haveAVX() | ||||
| var ssse3 = haveSSSE3() | ||||
| 
 | ||||
| // haveSSE returns true if we have streaming SIMD instructions. | ||||
| // haveAVX returns true when there is AVX support | ||||
| func haveAVX() bool { | ||||
| 	_, _, c, _ := cpuid(1) | ||||
| 
 | ||||
| @ -35,3 +38,23 @@ func haveAVX() bool { | ||||
| 	} | ||||
| 	return false | ||||
| } | ||||
| 
 | ||||
| // haveAVX2 returns true when there is AVX2 support | ||||
| func haveAVX2() bool { | ||||
| 	mfi, _, _, _ := cpuid(0) | ||||
| 
 | ||||
| 	// Check AVX2, AVX2 requires OS support, but BMI1/2 don't. | ||||
| 	if mfi >= 7 && haveAVX() { | ||||
| 		_, ebx, _, _ := cpuidex(7, 0) | ||||
| 		return (ebx & 0x00000020) != 0 | ||||
| 	} | ||||
| 	return false | ||||
| } | ||||
| 
 | ||||
| // haveSSSE3 returns true when there is SSSE3 support | ||||
| func haveSSSE3() bool { | ||||
| 
 | ||||
| 	_, _, c, _ := cpuid(1) | ||||
| 
 | ||||
| 	return (c & 0x00000200) != 0 | ||||
| } | ||||
|  | ||||
							
								
								
									
										21
									
								
								vendor/github.com/minio/blake2b-simd/cpuid_386.s
									
									
									
										generated
									
									
										vendored
									
									
								
							
							
						
						
									
										21
									
								
								vendor/github.com/minio/blake2b-simd/cpuid_386.s
									
									
									
										generated
									
									
										vendored
									
									
								
							| @ -13,10 +13,21 @@ TEXT ·cpuid(SB), 7, $0 | ||||
|         MOVL DX, edx+16(FP) | ||||
|         RET | ||||
| 
 | ||||
| // func cpuidex(op, op2 uint32) (eax, ebx, ecx, edx uint32) | ||||
| TEXT ·cpuidex(SB), 7, $0 | ||||
|         MOVL op+0(FP), AX | ||||
|         MOVL op2+4(FP), CX | ||||
|         CPUID | ||||
|         MOVL AX, eax+8(FP) | ||||
|         MOVL BX, ebx+12(FP) | ||||
|         MOVL CX, ecx+16(FP) | ||||
|         MOVL DX, edx+20(FP) | ||||
|         RET | ||||
| 
 | ||||
| // func xgetbv(index uint32) (eax, edx uint32) | ||||
| TEXT ·xgetbv(SB), 7, $0 | ||||
| 	MOVL index+0(FP), CX | ||||
| 	BYTE $0x0f; BYTE $0x01; BYTE $0xd0 // XGETBV
 | ||||
| 	MOVL AX, eax+4(FP) | ||||
| 	MOVL DX, edx+8(FP) | ||||
| 	RET | ||||
|         MOVL index+0(FP), CX | ||||
|         BYTE $0x0f; BYTE $0x01; BYTE $0xd0 // XGETBV
 | ||||
|         MOVL AX, eax+4(FP) | ||||
|         MOVL DX, edx+8(FP) | ||||
|         RET | ||||
|  | ||||
							
								
								
									
										22
									
								
								vendor/github.com/minio/blake2b-simd/cpuid_amd64.s
									
									
									
										generated
									
									
										vendored
									
									
								
							
							
						
						
									
										22
									
								
								vendor/github.com/minio/blake2b-simd/cpuid_amd64.s
									
									
									
										generated
									
									
										vendored
									
									
								
							| @ -13,10 +13,22 @@ TEXT ·cpuid(SB), 7, $0 | ||||
|         MOVL DX, edx+20(FP) | ||||
|         RET | ||||
| 
 | ||||
| 
 | ||||
| // func cpuidex(op, op2 uint32) (eax, ebx, ecx, edx uint32) | ||||
| TEXT ·cpuidex(SB), 7, $0 | ||||
|         MOVL op+0(FP), AX | ||||
|         MOVL op2+4(FP), CX | ||||
|         CPUID | ||||
|         MOVL AX, eax+8(FP) | ||||
|         MOVL BX, ebx+12(FP) | ||||
|         MOVL CX, ecx+16(FP) | ||||
|         MOVL DX, edx+20(FP) | ||||
|         RET | ||||
| 
 | ||||
| // func xgetbv(index uint32) (eax, edx uint32) | ||||
| TEXT ·xgetbv(SB), 7, $0 | ||||
| 	MOVL index+0(FP), CX | ||||
| 	BYTE $0x0f; BYTE $0x01; BYTE $0xd0 // XGETBV
 | ||||
| 	MOVL AX, eax+8(FP) | ||||
| 	MOVL DX, edx+12(FP) | ||||
| 	RET | ||||
|         MOVL index+0(FP), CX | ||||
|         BYTE $0x0f; BYTE $0x01; BYTE $0xd0 // XGETBV
 | ||||
|         MOVL AX, eax+8(FP) | ||||
|         MOVL DX, edx+12(FP) | ||||
|         RET | ||||
|  | ||||
							
								
								
									
										4
									
								
								vendor/vendor.json
									
									
									
									
										vendored
									
									
								
							
							
						
						
									
										4
									
								
								vendor/vendor.json
									
									
									
									
										vendored
									
									
								
							| @ -84,8 +84,8 @@ | ||||
| 		}, | ||||
| 		{ | ||||
| 			"path": "github.com/minio/blake2b-simd", | ||||
| 			"revision": "0b3e695ecc77a334fafe30ee36de504c41ec4d6a", | ||||
| 			"revisionTime": "2016-06-28T02:55:56-07:00" | ||||
| 			"revision": "25efc542f2c5064cf312cdca043790a7af861c4c", | ||||
| 			"revisionTime": "2016-07-06T10:29:24+02:00" | ||||
| 		}, | ||||
| 		{ | ||||
| 			"path": "github.com/minio/cli", | ||||
|  | ||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user