@iota/pearl-diver-react-native
Version:
Transaction nonce searcher for IOTA apps built with react-native.
278 lines (215 loc) • 15.8 kB
Markdown
# PCurl
Fast reference implementation of vectorized IOTA Curl transform function; tests and benchmarks.
# Design overview
PCurl runs `n` Curl instances at once following SIMD idea. A trit is represented with 2 bits `(low,high)`; there are `4! = 24` such representations.
`n` trits thus can be packed into two `word`s: one containing `n` `low` bits and the other -- `n` `high` bits, provided `n` bits fit `word`.
A pair of `word`s is called `dword`. PCurl `state` consists of 743 `dwords`. The building block for a Sponge construction is `transform : state -> state` mapping which
applies simpler transform `pcurl_sbox : state -> state` `r` times. The following property can be stated regarding `i`-th output state trit of `pcurl_sbox`:
```
pcurl_sbox(s)[i] == pcurl_s2(s[a(i)], s[b(i)]),
```
where `pcurl_s2 : dword x dword -> dword` is a building block of PCurl that mixes two input trits into one output trit. `a, b : Z_743 -> Z_743` are just fixed
permutations that determine dependencies between input and output trits. The circuit evaluating `pcurl_s2` depends on trit representation and on availability
of corresponding processor instructions.
# Optimizations
There are a number of optimization parameters that can be selected at compile time. They are compared against the "basic" PCurl implementation in iotaledger/iota_common.
## `PTRIT_CVT`
Trit representation.
- `PTRIT_CVT_ANDN`: `-1 -> (1,0)`, `0 -> (1,1)`, `+1 -> (0,1)`; optimizes `pcurl_s2` for `andn` instruction; used in the "basic" PCurl impl.
- `PTRIT_CVT_ORN`: `-1 -> (0,0)`, `0 -> (0,1)`, `+1 -> (1,1)`; optimizes `pcurl_s2` for `orn` instruction.
## `PTRIT_PLATFORM`
A straight-forward idea is to SIMD intrinsics to extend capacity of `word` and use available instructions for circuit evaluation.
- `PTRIT_NEON`: 128-bit NEON intrinsics on ARMv7-a targets; enforces the use of `PTRIT_CVT_ORN`.
- `PTRIT_AVX512`: 512-bit AVX512F intrinsics on Intel/AMD targets; enforces the use of `PTRIT_CVT_ANDN`.
- `PTRIT_AVX2`: 256-bit AVX2 intrinsics on Intel/AMD targets; enforces the use of `PTRIT_CVT_ANDN`.
- `PTRIT_SSE2`: 128-bit SSE2 intrinsics on Intel/AMD targets; enforces the use of `PTRIT_CVT_ANDN`.
- `PTRIT_64`: `word` is just 64-bit `uint64_t` type and standard C syntax is used to express `pcurl_s2` circuit.
## `PCURL_S2_CIRCUIT`
Enables the use of certain `pcurl_s2` circuit.
- `PCURL_S2_CIRCUIT4`: optimized 4-instruction circuits found with `circuit-solver`.
With `PTRIT_CVT_ANDN` -- `(Xor AH (Andn BL AL),Xor AL (And BH (Xor AH (Andn BL AL))))`.
With `PTRIT_CVT_OR` -- `(Xor AH (Orn BL AL),Xor AL (Orn BH (Xor AH (Orn BL AL))))`.
- `PCURL_S2_CIRCUIT5`: 5-instruction circuit used in the "basic" PCurl impl.
This option is not available and is listed here for the sake of completeness. `PCURL_S2_CIRCUIT4` is permanently enabled.
## `PCURL_S2_ARGS`
Determines the way arguments `s[a(i)]` and `s[b(i)]` to `pcurl_s2` are selected, ie. the way `a` and `b` permutations are implemented.
- `PCURL_S2_ARGS_PTR`: through pointer arithmetic.
- `PCURL_S2_ARGS_LUT`: through look-up table as in the "basic" PCurl impl.
This option is not available and is listed here for the sake of completeness. `PCURL_S2_ARGS_PTR` is permanently enabled.
## `PCURL_SBOX_UNWIND`
Unwind factor for `pcurl_sbox` internal loop, ie. the number of call to `pcurl_s2` per loop. Not all unwind factors are allowed with certain configurations.
- `PCURL_SBOX_UNWIND_1`: 1, ie. do not unwind.
- `PCURL_SBOX_UNWIND_2`: 2.
- `PCURL_SBOX_UNWIND_4`: 4.
- `PCURL_SBOX_UNWIND_8`: 8.
## `PCURL_STATE`
`pcurl_sbox` requires additional memory as output trits are highly interconnected with input.
- `PCURL_STATE_SHORT`: use roughly half of state as additional. It is achieved through different `pcurl_sbox`es as they access memory in different fashion.
- `PCURL_STATE_DOUBLE`: trivially use a full state as additional as in the "basic" PCurl impl.
## TODO: Fuse rounds of `pcurl_sbox`
Make a fused `pcurl_sboxN` for `N=2,3,4`, find an optimized `pcurl_s2xN` circuit with `2^N` input arguments, determine pointer schedule for `pcurl_sbox`.
# Build and Run
See [bench_arm.sh](bench_arm.sh), [bench_x64_msvc.bat](bench_x64_msvc.bat), and [bench_x64_gcc.bat](bench_x64_gcc.bat) for examples. You might need to adjust your `PATH` environment variable.
# Benchmark results
`Time` is measured using `clock` function, benchmark results have quite significant variance due to platform specifics (mobile PC, noisy environment).
Time presented is "average" of several benchmark runs. `Speed` is measured in transactions per second. Transaction size is 8019 trits, or 33 `rate`s,
ie. it takes 33 transforms to hash a transaction. SIMD vectorization factor is to the left in the `Distance` column.
Curl-81 was used to hash transactions in the benchmarks.
Short configuration has the following format:
```
<compiler>_<platform>_<cvt>_<circuit>_<s2_args>_<unwind>_<state>
```
where
- `<compiler>`: `gcc`, `clang`, or `msvc`; version omitted.
- `<platform>`: `neon`=`PTRIT_NEON`, `avx512`=`PTRIT_AVX512`, `avx2`=`PTRIT_AVX2`, `sse2`=`PTRIT_SSE2`, `64`=`PTRIT_64`.
- `<cvt>`: `andn`=`PTRIT_CVT_ANDN`, `orn`=`PTRIT_CVT_ORN`.
- `<circuit>`: `c4`=`PCURL_S2_CIRCUIT4`, `c5`=`PCURL_S2_CIRCUIT5`.
- `<s2_args>`: `ptr`=`PCURL_S2_ARGS_PTR`, `lut`=`PCURL_S2_ARGS_LUT`.
- `<unwind>`: `uX`=`PCURL_SBOX_UNWIND_X`.
- `<state>`: `ss`=`PCURL_STATE_SHORT`, `sd`=`PCURL_STATE_DOUBLE`.
## MacBook Pro
Environment: 2,6 GHz Intel Core i7, 2400 MHz DDR4, Darwin 18.7.0, Apple LLVM 10.0.1.
*Speedup*: 3.56x.
| Configuration | Speed, tx/s | Distance, tx | Time, ms |
| ------------------------------ | ----------: | ------------: | -------: |
| `gcc_avx2_andn_c4_ptr_u8_sd` | 108598 tx/s | 256 x 2500 tx | 5893 ms |
| | 108864 tx/s | 256 x 2500 tx | 5878 ms |
| | 109836 tx/s | 256 x 2500 tx | 5826 ms |
| `gcc_avx2_andn_c4_ptr_u4_ss` | 107704 tx/s | 256 x 2500 tx | 5942 ms |
| | 107957 tx/s | 256 x 2500 tx | 5928 ms |
| | 107222 tx/s | 256 x 2500 tx | 5968 ms |
| `gcc_avx2_andn_c4_ptr_u4_sd` | 105338 tx/s | 256 x 2500 tx | 6075 ms |
| | 106084 tx/s | 256 x 2500 tx | 6032 ms |
| | 106209 tx/s | 256 x 2500 tx | 6025 ms |
| `gcc_avx2_andn_c4_ptr_u2_sd` | 102943 tx/s | 256 x 2500 tx | 6217 ms |
| | 102129 tx/s | 256 x 2500 tx | 6266 ms |
| | 102387 tx/s | 256 x 2500 tx | 6250 ms |
| `gcc_sse2_andn_c4_ptr_u4_ss` | 74331 tx/s | 128 x 5000 tx | 8610 ms |
| | 74476 tx/s | 128 x 5000 tx | 8593 ms |
| | 75133 tx/s | 128 x 5000 tx | 8518 ms |
| `gcc_64_andn_c4_ptr_u4_ss` | 37366 tx/s | 64 x 10000 tx | 17127 ms |
| | 37691 tx/s | 64 x 10000 tx | 16980 ms |
| | 37681 tx/s | 64 x 10000 tx | 16984 ms |
| `gcc_64_andn_c5_lut_u1_sd` | 30482 tx/s | 64 x 10000 tx | 20995 ms |
| | 30497 tx/s | 64 x 10000 tx | 20985 ms |
| | 29735 tx/s | 64 x 10000 tx | 21523 ms |
## Laptop ThinkPad T-490s
Environment: Intel Core i7-8565U, Ubuntu 18.04, gcc-7.4.0.
*Speedup*: 3.75x.
| Configuration | Speed, tx/s | Distance, tx | Time, ms |
| ------------------------------ | ---------: | ------------: | -------: |
| `gcc_avx2_andn_c4_ptr_u8_sd` | 95345 tx/s | 256 x 2500 tx | 6712 ms |
| | 91675 tx/s | 256 x 2500 tx | 6981 ms |
| | 88421 tx/s | 256 x 2500 tx | 7238 ms |
| `gcc_avx2_andn_c4_ptr_u4_ss` | 89393 tx/s | 256 x 2500 tx | 7159 ms |
| | 88847 tx/s | 256 x 2500 tx | 7203 ms |
| | 90907 tx/s | 256 x 2500 tx | 7040 ms |
| `gcc_avx2_andn_c4_ptr_u2_sd` | 82903 tx/s | 256 x 2500 tx | 7719 ms |
| | 84771 tx/s | 256 x 2500 tx | 7549 ms |
| | 83058 tx/s | 256 x 2500 tx | 7705 ms |
| `gcc_avx2_andn_c4_ptr_u4_sd` | 88255 tx/s | 256 x 2500 tx | 7251 ms |
| | 85882 tx/s | 256 x 2500 tx | 7452 ms |
| | 83931 tx/s | 256 x 2500 tx | 7625 ms |
| `gcc_sse2_andn_c4_ptr_u4_ss` | 70663 tx/s | 128 x 5000 tx | 9057 ms |
| | 71107 tx/s | 128 x 5000 tx | 9000 ms |
| | 72600 tx/s | 128 x 5000 tx | 8815 ms |
| `gcc_64_andn_c4_ptr_u4_ss` | 42553 tx/s | 64 x 10000 tx | 15039 ms |
| | 40221 tx/s | 64 x 10000 tx | 15912 ms |
| | 39609 tx/s | 64 x 10000 tx | 16157 ms |
| `gcc_64_andn_c5_lut_u1_sd` | 24019 tx/s | 64 x 10000 tx | 26645 ms |
| | 24765 tx/s | 64 x 10000 tx | 25842 ms |
| | 24478 tx/s | 64 x 10000 tx | 26145 ms |
## Laptop HP ProBook
Environment: Intel Core i7 4510U, Windows 8.1, msvc-14.1, gcc-8.0.1.
AVX512 implementation was tested with Intel sde-8.35.0.
*Speedup*: 2.71x.
| Configuration | Speed, tx/s | Distance, tx | Time, ms |
| ------------------------------ | ---------: | ------------: | -------: |
| `gcc_avx512_andn_c4_ptr_u4_ss` | 575 tx/s | 512 x 10 tx | 8897 ms |
| | 579 tx/s | 512 x 10 tx | 8836 ms |
| | 572 tx/s | 512 x 10 tx | 8940 ms |
| `gcc_sse2_andn_c4_ptr_u4_ss` | 55541 tx/s | 128 x 5000 tx | 11523 ms |
| | 48724 tx/s | 128 x 5000 tx | 13135 ms |
| | 46299 tx/s | 128 x 5000 tx | 13823 ms |
| `msvc_sse2_andn_c4_ptr_u4_ss` | 48799 tx/s | 128 x 5000 tx | 13115 ms |
| | 48167 tx/s | 128 x 5000 tx | 13287 ms |
| | 46879 tx/s | 128 x 5000 tx | 13652 ms |
| `msvc_avx2_andn_c4_ptr_u4_ss` | 47672 tx/s | 256 x 2500 tx | 13425 ms |
| | 47693 tx/s | 256 x 2500 tx | 13419 ms |
| | 47038 tx/s | 256 x 2500 tx | 13606 ms |
| `gcc_avx2_andn_c4_ptr_u4_ss` | 47107 tx/s | 256 x 2500 tx | 13586 ms |
| | 46931 tx/s | 256 x 2500 tx | 13637 ms |
| | 47368 tx/s | 256 x 2500 tx | 13511 ms |
| `msvc_avx2_andn_c4_ptr_u4_sd` | 45133 tx/s | 256 x 2500 tx | 14180 ms |
| | 45045 tx/s | 256 x 2500 tx | 14208 ms |
| | 45169 tx/s | 256 x 2500 tx | 14169 ms |
| `msvc_avx2_andn_c4_ptr_u2_sd` | 43354 tx/s | 256 x 2500 tx | 14762 ms |
| | 43950 tx/s | 256 x 2500 tx | 14562 ms |
| | 43281 tx/s | 256 x 2500 tx | 14787 ms |
| `gcc_avx2_andn_c4_ptr_u2_sd` | 42780 tx/s | 256 x 2500 tx | 14960 ms |
| | 42108 tx/s | 256 x 2500 tx | 15199 ms |
| | 37941 tx/s | 256 x 2500 tx | 16868 ms |
| `gcc_avx2_andn_c4_ptr_u8_sd` | 41674 tx/s | 256 x 2500 tx | 15357 ms |
| | 40145 tx/s | 256 x 2500 tx | 15942 ms |
| | 36527 tx/s | 256 x 2500 tx | 17521 ms |
| `gcc_avx2_andn_c4_ptr_u4_sd` | 39438 tx/s | 256 x 2500 tx | 16228 ms |
| | 39885 tx/s | 256 x 2500 tx | 16046 ms |
| | 40231 tx/s | 256 x 2500 tx | 15908 ms |
| `msvc_64_andn_c4_ptr_u4_ss` | 24781 tx/s | 64 x 10000 tx | 25826 ms |
| | 29723 tx/s | 64 x 10000 tx | 21532 ms |
| | 29684 tx/s | 64 x 10000 tx | 21560 ms |
| `gcc_64_andn_c4_ptr_u4_ss` | 28567 tx/s | 64 x 10000 tx | 22403 ms |
| | 28731 tx/s | 64 x 10000 tx | 22275 ms |
| | 28519 tx/s | 64 x 10000 tx | 22441 ms |
| `gcc_64_andn_c5_lut_u1_sd` | 18489 tx/s | 64 x 10000 tx | 34615 ms |
| | 18003 tx/s | 64 x 10000 tx | 35549 ms |
| | 16341 tx/s | 64 x 10000 tx | 39164 ms |
| `msvc_64_andn_c5_lut_u1_sd` | 16291 tx/s | 64 x 10000 tx | 39284 ms |
| | 16296 tx/s | 64 x 10000 tx | 39272 ms |
| | 16044 tx/s | 64 x 10000 tx | 39890 ms |
## Raspberry Pi 3
Environment: ARMv7l, raspbian (4.4.13-v7+), gcc-6.1.0.
*Speedup*: 1.90x.
| Configuration | Speed, tx/s | Distance, tx | Time, ms |
| ------------------------------ | ---------- | ------------- | -------- |
| `gcc_64_andn_c4_ptr_u8_sd` | 3494 tx/s | 64 x 1000 tx | 18316 ms |
| | 3492 tx/s | 64 x 1000 tx | 18323 ms |
| | 3487 tx/s | 64 x 1000 tx | 18348 ms |
| `gcc_64_andn_c4_ptr_u4_ss` | 3320 tx/s | 64 x 1000 tx | 19273 ms |
| | 3329 tx/s | 64 x 1000 tx | 19219 ms |
| | 3331 tx/s | 64 x 1000 tx | 19210 ms |
| `gcc_neon_orn_c4_ptr_u4_ss` | 3303 tx/s | 128 x 500 tx | 19371 ms |
| | 3306 tx/s | 128 x 500 tx | 19355 ms |
| | 3305 tx/s | 128 x 500 tx | 19359 ms |
| `gcc_neon_orn_c4_ptr_u8_sd` | 3270 tx/s | 128 x 500 tx | 19571 ms |
| | 3271 tx/s | 128 x 500 tx | 19561 ms |
| | 3271 tx/s | 128 x 500 tx | 19560 ms |
| `gcc_64_orn_c4_ptr_u4_ss` | 2738 tx/s | 64 x 1000 tx | 23367 ms |
| | 2723 tx/s | 64 x 1000 tx | 23501 ms |
| | 2732 tx/s | 64 x 1000 tx | 23419 ms |
| `gcc_64_andn_c5_lut_u1_sd` | 1832 tx/s | 64 x 1000 tx | 34924 ms |
| | 1832 tx/s | 64 x 1000 tx | 34923 ms |
| | 1832 tx/s | 64 x 1000 tx | 34925 ms |
## Samsung Grand Prime
Environment: ARMv8, Android 5.1.1, clang-8.0.0.
*Speedup*: 2.37x.
| Configuration | Speed, tx/s | Distance, tx | Time, ms |
| ------------------------------ | ---------- | ------------- | -------- |
| `clang_neon_orn_c4_ptr_u4_ss` | 3345 tx/s | 128 x 500 tx | 19132 ms |
| | 3350 tx/s | 128 x 500 tx | 19099 ms |
| | 3350 tx/s | 128 x 500 tx | 19099 ms |
| `clang_neon_orn_c4_ptr_u8_sd` | 3086 tx/s | 128 x 500 tx | 20733 ms |
| | 3083 tx/s | 128 x 500 tx | 20757 ms |
| | 3069 tx/s | 128 x 500 tx | 20850 ms |
| `clang_64_andn_c4_ptr_u4_ss` | 1791 tx/s | 64 x 1000 tx | 35732 ms |
| | 1791 tx/s | 64 x 1000 tx | 35725 ms |
| | 1790 tx/s | 64 x 1000 tx | 35750 ms |
| `clang_64_orn_c4_ptr_u4_ss` | 1561 tx/s | 64 x 1000 tx | 40991 ms |
| | 1560 tx/s | 64 x 1000 tx | 41018 ms |
| | 1561 tx/s | 64 x 1000 tx | 40981 ms |
| `clang_64_andn_c4_ptr_u8_sd` | 1490 tx/s | 64 x 1000 tx | 42928 ms |
| | 1491 tx/s | 64 x 1000 tx | 42916 ms |
| | 1490 tx/s | 64 x 1000 tx | 42946 ms |
| `clang_64_andn_c5_lut_u1_sd` | 1413 tx/s | 64 x 1000 tx | 45287 ms |
| | 1413 tx/s | 64 x 1000 tx | 45274 ms |
| | 1413 tx/s | 64 x 1000 tx | 45281 ms |