UNPKG

@iota/pearl-diver-react-native

Version:

Transaction nonce searcher for IOTA apps built with react-native.

278 lines (215 loc) 15.8 kB
# PCurl Fast reference implementation of vectorized IOTA Curl transform function; tests and benchmarks. # Design overview PCurl runs `n` Curl instances at once following SIMD idea. A trit is represented with 2 bits `(low,high)`; there are `4! = 24` such representations. `n` trits thus can be packed into two `word`s: one containing `n` `low` bits and the other -- `n` `high` bits, provided `n` bits fit `word`. A pair of `word`s is called `dword`. PCurl `state` consists of 743 `dwords`. The building block for a Sponge construction is `transform : state -> state` mapping which applies simpler transform `pcurl_sbox : state -> state` `r` times. The following property can be stated regarding `i`-th output state trit of `pcurl_sbox`: ``` pcurl_sbox(s)[i] == pcurl_s2(s[a(i)], s[b(i)]), ``` where `pcurl_s2 : dword x dword -> dword` is a building block of PCurl that mixes two input trits into one output trit. `a, b : Z_743 -> Z_743` are just fixed permutations that determine dependencies between input and output trits. The circuit evaluating `pcurl_s2` depends on trit representation and on availability of corresponding processor instructions. # Optimizations There are a number of optimization parameters that can be selected at compile time. They are compared against the "basic" PCurl implementation in iotaledger/iota_common. ## `PTRIT_CVT` Trit representation. - `PTRIT_CVT_ANDN`: `-1 -> (1,0)`, `0 -> (1,1)`, `+1 -> (0,1)`; optimizes `pcurl_s2` for `andn` instruction; used in the "basic" PCurl impl. - `PTRIT_CVT_ORN`: `-1 -> (0,0)`, `0 -> (0,1)`, `+1 -> (1,1)`; optimizes `pcurl_s2` for `orn` instruction. ## `PTRIT_PLATFORM` A straight-forward idea is to SIMD intrinsics to extend capacity of `word` and use available instructions for circuit evaluation. - `PTRIT_NEON`: 128-bit NEON intrinsics on ARMv7-a targets; enforces the use of `PTRIT_CVT_ORN`. - `PTRIT_AVX512`: 512-bit AVX512F intrinsics on Intel/AMD targets; enforces the use of `PTRIT_CVT_ANDN`. - `PTRIT_AVX2`: 256-bit AVX2 intrinsics on Intel/AMD targets; enforces the use of `PTRIT_CVT_ANDN`. - `PTRIT_SSE2`: 128-bit SSE2 intrinsics on Intel/AMD targets; enforces the use of `PTRIT_CVT_ANDN`. - `PTRIT_64`: `word` is just 64-bit `uint64_t` type and standard C syntax is used to express `pcurl_s2` circuit. ## `PCURL_S2_CIRCUIT` Enables the use of certain `pcurl_s2` circuit. - `PCURL_S2_CIRCUIT4`: optimized 4-instruction circuits found with `circuit-solver`. With `PTRIT_CVT_ANDN` -- `(Xor AH (Andn BL AL),Xor AL (And BH (Xor AH (Andn BL AL))))`. With `PTRIT_CVT_OR` -- `(Xor AH (Orn BL AL),Xor AL (Orn BH (Xor AH (Orn BL AL))))`. - `PCURL_S2_CIRCUIT5`: 5-instruction circuit used in the "basic" PCurl impl. This option is not available and is listed here for the sake of completeness. `PCURL_S2_CIRCUIT4` is permanently enabled. ## `PCURL_S2_ARGS` Determines the way arguments `s[a(i)]` and `s[b(i)]` to `pcurl_s2` are selected, ie. the way `a` and `b` permutations are implemented. - `PCURL_S2_ARGS_PTR`: through pointer arithmetic. - `PCURL_S2_ARGS_LUT`: through look-up table as in the "basic" PCurl impl. This option is not available and is listed here for the sake of completeness. `PCURL_S2_ARGS_PTR` is permanently enabled. ## `PCURL_SBOX_UNWIND` Unwind factor for `pcurl_sbox` internal loop, ie. the number of call to `pcurl_s2` per loop. Not all unwind factors are allowed with certain configurations. - `PCURL_SBOX_UNWIND_1`: 1, ie. do not unwind. - `PCURL_SBOX_UNWIND_2`: 2. - `PCURL_SBOX_UNWIND_4`: 4. - `PCURL_SBOX_UNWIND_8`: 8. ## `PCURL_STATE` `pcurl_sbox` requires additional memory as output trits are highly interconnected with input. - `PCURL_STATE_SHORT`: use roughly half of state as additional. It is achieved through different `pcurl_sbox`es as they access memory in different fashion. - `PCURL_STATE_DOUBLE`: trivially use a full state as additional as in the "basic" PCurl impl. ## TODO: Fuse rounds of `pcurl_sbox` Make a fused `pcurl_sboxN` for `N=2,3,4`, find an optimized `pcurl_s2xN` circuit with `2^N` input arguments, determine pointer schedule for `pcurl_sbox`. # Build and Run See [bench_arm.sh](bench_arm.sh), [bench_x64_msvc.bat](bench_x64_msvc.bat), and [bench_x64_gcc.bat](bench_x64_gcc.bat) for examples. You might need to adjust your `PATH` environment variable. # Benchmark results `Time` is measured using `clock` function, benchmark results have quite significant variance due to platform specifics (mobile PC, noisy environment). Time presented is "average" of several benchmark runs. `Speed` is measured in transactions per second. Transaction size is 8019 trits, or 33 `rate`s, ie. it takes 33 transforms to hash a transaction. SIMD vectorization factor is to the left in the `Distance` column. Curl-81 was used to hash transactions in the benchmarks. Short configuration has the following format: ``` <compiler>_<platform>_<cvt>_<circuit>_<s2_args>_<unwind>_<state> ``` where - `<compiler>`: `gcc`, `clang`, or `msvc`; version omitted. - `<platform>`: `neon`=`PTRIT_NEON`, `avx512`=`PTRIT_AVX512`, `avx2`=`PTRIT_AVX2`, `sse2`=`PTRIT_SSE2`, `64`=`PTRIT_64`. - `<cvt>`: `andn`=`PTRIT_CVT_ANDN`, `orn`=`PTRIT_CVT_ORN`. - `<circuit>`: `c4`=`PCURL_S2_CIRCUIT4`, `c5`=`PCURL_S2_CIRCUIT5`. - `<s2_args>`: `ptr`=`PCURL_S2_ARGS_PTR`, `lut`=`PCURL_S2_ARGS_LUT`. - `<unwind>`: `uX`=`PCURL_SBOX_UNWIND_X`. - `<state>`: `ss`=`PCURL_STATE_SHORT`, `sd`=`PCURL_STATE_DOUBLE`. ## MacBook Pro Environment: 2,6 GHz Intel Core i7, 2400 MHz DDR4, Darwin 18.7.0, Apple LLVM 10.0.1. *Speedup*: 3.56x. | Configuration | Speed, tx/s | Distance, tx | Time, ms | | ------------------------------ | ----------: | ------------: | -------: | | `gcc_avx2_andn_c4_ptr_u8_sd` | 108598 tx/s | 256 x 2500 tx | 5893 ms | | | 108864 tx/s | 256 x 2500 tx | 5878 ms | | | 109836 tx/s | 256 x 2500 tx | 5826 ms | | `gcc_avx2_andn_c4_ptr_u4_ss` | 107704 tx/s | 256 x 2500 tx | 5942 ms | | | 107957 tx/s | 256 x 2500 tx | 5928 ms | | | 107222 tx/s | 256 x 2500 tx | 5968 ms | | `gcc_avx2_andn_c4_ptr_u4_sd` | 105338 tx/s | 256 x 2500 tx | 6075 ms | | | 106084 tx/s | 256 x 2500 tx | 6032 ms | | | 106209 tx/s | 256 x 2500 tx | 6025 ms | | `gcc_avx2_andn_c4_ptr_u2_sd` | 102943 tx/s | 256 x 2500 tx | 6217 ms | | | 102129 tx/s | 256 x 2500 tx | 6266 ms | | | 102387 tx/s | 256 x 2500 tx | 6250 ms | | `gcc_sse2_andn_c4_ptr_u4_ss` | 74331 tx/s | 128 x 5000 tx | 8610 ms | | | 74476 tx/s | 128 x 5000 tx | 8593 ms | | | 75133 tx/s | 128 x 5000 tx | 8518 ms | | `gcc_64_andn_c4_ptr_u4_ss` | 37366 tx/s | 64 x 10000 tx | 17127 ms | | | 37691 tx/s | 64 x 10000 tx | 16980 ms | | | 37681 tx/s | 64 x 10000 tx | 16984 ms | | `gcc_64_andn_c5_lut_u1_sd` | 30482 tx/s | 64 x 10000 tx | 20995 ms | | | 30497 tx/s | 64 x 10000 tx | 20985 ms | | | 29735 tx/s | 64 x 10000 tx | 21523 ms | ## Laptop ThinkPad T-490s Environment: Intel Core i7-8565U, Ubuntu 18.04, gcc-7.4.0. *Speedup*: 3.75x. | Configuration | Speed, tx/s | Distance, tx | Time, ms | | ------------------------------ | ---------: | ------------: | -------: | | `gcc_avx2_andn_c4_ptr_u8_sd` | 95345 tx/s | 256 x 2500 tx | 6712 ms | | | 91675 tx/s | 256 x 2500 tx | 6981 ms | | | 88421 tx/s | 256 x 2500 tx | 7238 ms | | `gcc_avx2_andn_c4_ptr_u4_ss` | 89393 tx/s | 256 x 2500 tx | 7159 ms | | | 88847 tx/s | 256 x 2500 tx | 7203 ms | | | 90907 tx/s | 256 x 2500 tx | 7040 ms | | `gcc_avx2_andn_c4_ptr_u2_sd` | 82903 tx/s | 256 x 2500 tx | 7719 ms | | | 84771 tx/s | 256 x 2500 tx | 7549 ms | | | 83058 tx/s | 256 x 2500 tx | 7705 ms | | `gcc_avx2_andn_c4_ptr_u4_sd` | 88255 tx/s | 256 x 2500 tx | 7251 ms | | | 85882 tx/s | 256 x 2500 tx | 7452 ms | | | 83931 tx/s | 256 x 2500 tx | 7625 ms | | `gcc_sse2_andn_c4_ptr_u4_ss` | 70663 tx/s | 128 x 5000 tx | 9057 ms | | | 71107 tx/s | 128 x 5000 tx | 9000 ms | | | 72600 tx/s | 128 x 5000 tx | 8815 ms | | `gcc_64_andn_c4_ptr_u4_ss` | 42553 tx/s | 64 x 10000 tx | 15039 ms | | | 40221 tx/s | 64 x 10000 tx | 15912 ms | | | 39609 tx/s | 64 x 10000 tx | 16157 ms | | `gcc_64_andn_c5_lut_u1_sd` | 24019 tx/s | 64 x 10000 tx | 26645 ms | | | 24765 tx/s | 64 x 10000 tx | 25842 ms | | | 24478 tx/s | 64 x 10000 tx | 26145 ms | ## Laptop HP ProBook Environment: Intel Core i7 4510U, Windows 8.1, msvc-14.1, gcc-8.0.1. AVX512 implementation was tested with Intel sde-8.35.0. *Speedup*: 2.71x. | Configuration | Speed, tx/s | Distance, tx | Time, ms | | ------------------------------ | ---------: | ------------: | -------: | | `gcc_avx512_andn_c4_ptr_u4_ss` | 575 tx/s | 512 x 10 tx | 8897 ms | | | 579 tx/s | 512 x 10 tx | 8836 ms | | | 572 tx/s | 512 x 10 tx | 8940 ms | | `gcc_sse2_andn_c4_ptr_u4_ss` | 55541 tx/s | 128 x 5000 tx | 11523 ms | | | 48724 tx/s | 128 x 5000 tx | 13135 ms | | | 46299 tx/s | 128 x 5000 tx | 13823 ms | | `msvc_sse2_andn_c4_ptr_u4_ss` | 48799 tx/s | 128 x 5000 tx | 13115 ms | | | 48167 tx/s | 128 x 5000 tx | 13287 ms | | | 46879 tx/s | 128 x 5000 tx | 13652 ms | | `msvc_avx2_andn_c4_ptr_u4_ss` | 47672 tx/s | 256 x 2500 tx | 13425 ms | | | 47693 tx/s | 256 x 2500 tx | 13419 ms | | | 47038 tx/s | 256 x 2500 tx | 13606 ms | | `gcc_avx2_andn_c4_ptr_u4_ss` | 47107 tx/s | 256 x 2500 tx | 13586 ms | | | 46931 tx/s | 256 x 2500 tx | 13637 ms | | | 47368 tx/s | 256 x 2500 tx | 13511 ms | | `msvc_avx2_andn_c4_ptr_u4_sd` | 45133 tx/s | 256 x 2500 tx | 14180 ms | | | 45045 tx/s | 256 x 2500 tx | 14208 ms | | | 45169 tx/s | 256 x 2500 tx | 14169 ms | | `msvc_avx2_andn_c4_ptr_u2_sd` | 43354 tx/s | 256 x 2500 tx | 14762 ms | | | 43950 tx/s | 256 x 2500 tx | 14562 ms | | | 43281 tx/s | 256 x 2500 tx | 14787 ms | | `gcc_avx2_andn_c4_ptr_u2_sd` | 42780 tx/s | 256 x 2500 tx | 14960 ms | | | 42108 tx/s | 256 x 2500 tx | 15199 ms | | | 37941 tx/s | 256 x 2500 tx | 16868 ms | | `gcc_avx2_andn_c4_ptr_u8_sd` | 41674 tx/s | 256 x 2500 tx | 15357 ms | | | 40145 tx/s | 256 x 2500 tx | 15942 ms | | | 36527 tx/s | 256 x 2500 tx | 17521 ms | | `gcc_avx2_andn_c4_ptr_u4_sd` | 39438 tx/s | 256 x 2500 tx | 16228 ms | | | 39885 tx/s | 256 x 2500 tx | 16046 ms | | | 40231 tx/s | 256 x 2500 tx | 15908 ms | | `msvc_64_andn_c4_ptr_u4_ss` | 24781 tx/s | 64 x 10000 tx | 25826 ms | | | 29723 tx/s | 64 x 10000 tx | 21532 ms | | | 29684 tx/s | 64 x 10000 tx | 21560 ms | | `gcc_64_andn_c4_ptr_u4_ss` | 28567 tx/s | 64 x 10000 tx | 22403 ms | | | 28731 tx/s | 64 x 10000 tx | 22275 ms | | | 28519 tx/s | 64 x 10000 tx | 22441 ms | | `gcc_64_andn_c5_lut_u1_sd` | 18489 tx/s | 64 x 10000 tx | 34615 ms | | | 18003 tx/s | 64 x 10000 tx | 35549 ms | | | 16341 tx/s | 64 x 10000 tx | 39164 ms | | `msvc_64_andn_c5_lut_u1_sd` | 16291 tx/s | 64 x 10000 tx | 39284 ms | | | 16296 tx/s | 64 x 10000 tx | 39272 ms | | | 16044 tx/s | 64 x 10000 tx | 39890 ms | ## Raspberry Pi 3 Environment: ARMv7l, raspbian (4.4.13-v7+), gcc-6.1.0. *Speedup*: 1.90x. | Configuration | Speed, tx/s | Distance, tx | Time, ms | | ------------------------------ | ---------- | ------------- | -------- | | `gcc_64_andn_c4_ptr_u8_sd` | 3494 tx/s | 64 x 1000 tx | 18316 ms | | | 3492 tx/s | 64 x 1000 tx | 18323 ms | | | 3487 tx/s | 64 x 1000 tx | 18348 ms | | `gcc_64_andn_c4_ptr_u4_ss` | 3320 tx/s | 64 x 1000 tx | 19273 ms | | | 3329 tx/s | 64 x 1000 tx | 19219 ms | | | 3331 tx/s | 64 x 1000 tx | 19210 ms | | `gcc_neon_orn_c4_ptr_u4_ss` | 3303 tx/s | 128 x 500 tx | 19371 ms | | | 3306 tx/s | 128 x 500 tx | 19355 ms | | | 3305 tx/s | 128 x 500 tx | 19359 ms | | `gcc_neon_orn_c4_ptr_u8_sd` | 3270 tx/s | 128 x 500 tx | 19571 ms | | | 3271 tx/s | 128 x 500 tx | 19561 ms | | | 3271 tx/s | 128 x 500 tx | 19560 ms | | `gcc_64_orn_c4_ptr_u4_ss` | 2738 tx/s | 64 x 1000 tx | 23367 ms | | | 2723 tx/s | 64 x 1000 tx | 23501 ms | | | 2732 tx/s | 64 x 1000 tx | 23419 ms | | `gcc_64_andn_c5_lut_u1_sd` | 1832 tx/s | 64 x 1000 tx | 34924 ms | | | 1832 tx/s | 64 x 1000 tx | 34923 ms | | | 1832 tx/s | 64 x 1000 tx | 34925 ms | ## Samsung Grand Prime Environment: ARMv8, Android 5.1.1, clang-8.0.0. *Speedup*: 2.37x. | Configuration | Speed, tx/s | Distance, tx | Time, ms | | ------------------------------ | ---------- | ------------- | -------- | | `clang_neon_orn_c4_ptr_u4_ss` | 3345 tx/s | 128 x 500 tx | 19132 ms | | | 3350 tx/s | 128 x 500 tx | 19099 ms | | | 3350 tx/s | 128 x 500 tx | 19099 ms | | `clang_neon_orn_c4_ptr_u8_sd` | 3086 tx/s | 128 x 500 tx | 20733 ms | | | 3083 tx/s | 128 x 500 tx | 20757 ms | | | 3069 tx/s | 128 x 500 tx | 20850 ms | | `clang_64_andn_c4_ptr_u4_ss` | 1791 tx/s | 64 x 1000 tx | 35732 ms | | | 1791 tx/s | 64 x 1000 tx | 35725 ms | | | 1790 tx/s | 64 x 1000 tx | 35750 ms | | `clang_64_orn_c4_ptr_u4_ss` | 1561 tx/s | 64 x 1000 tx | 40991 ms | | | 1560 tx/s | 64 x 1000 tx | 41018 ms | | | 1561 tx/s | 64 x 1000 tx | 40981 ms | | `clang_64_andn_c4_ptr_u8_sd` | 1490 tx/s | 64 x 1000 tx | 42928 ms | | | 1491 tx/s | 64 x 1000 tx | 42916 ms | | | 1490 tx/s | 64 x 1000 tx | 42946 ms | | `clang_64_andn_c5_lut_u1_sd` | 1413 tx/s | 64 x 1000 tx | 45287 ms | | | 1413 tx/s | 64 x 1000 tx | 45274 ms | | | 1413 tx/s | 64 x 1000 tx | 45281 ms |