ems-typed

Version:

Persistent Shared Memory and Parallel Programming Model

github.com/homfen/ems.git

homfen/ems

101 lines (80 loc) • 3.68 kB

Markdown

View Raw

100

101

# EMS.js STREAMS Bandwidth Benchmark The original [STREAMS](https://www.cs.virginia.edu/stream/) benchmark measures the _Speed Not To Exceed_ bandwidth of a CPU to it's attached RAM by performing large numbers of simple operations. This version for EMS<nolink>.js measures the rate of different atomic operations on shared integer data and is implemented using two different parallelization schemes: Bulk Synchronous Parallel (BSP), and Fork-Join. ## Parallel Execution Models The EMS.js STREAMS benchmark is implemented twice, once using Bulk Synchronous Parallelism and again using Fork-Join parallelism. BSP goes parallel at program startup, decomposes loop iterations across processes, and performs a barrier at the end of each loop. FJ starts serial and goes parallel for each loop, joining back to serial execution before the next loop. <table> <tr> <td width="50%"> <center> <img src="../../Docs/ParallelContextsBSP.svg" type="image/svg+xml" height="500px"> </center> </td> <td width="50%"> <center> <img src="../../Docs/ParallelContextsFJ.svg" type="image/svg+xml" height="500px"> </td> </tr> </table> ## Running the Benchmark To run either version of STREAMS benchmark, specify the number of processes as an argument to the program: ```bash node streams_bulk_sync_parallel.js 8 ``` Running the BSP version of the STREAMS benchmark reports the rate at which each loop is performed, each loop exercising a single EMS read and write primitive. A second set of "Native" loops performs similar operations on ordinary volatile private memory, taking advantage of all possible native optimizations and just-in-time compilation. These results show a performance difference of 20-50x accessing EMS memory versus native memory, which is still very favorable compared to the 1000-5000x difference accessing a remote key-value store, or worse accessing magnetic media. ```bash dunlin> time node streams_bulk_sync_parallel.js 3 EMStask 0: 1,000,000 write 6,896,551 ops/sec EMStask 0: 1,000,000 writeXE 7,246,376 ops/sec EMStask 0: 1,000,000 writeXF 7,246,376 ops/sec EMStask 0: 1,000,000 read 7,042,253 ops/sec EMStask 0: 1,000,000 reread 6,896,551 ops/sec EMStask 0: 1,000,000 readFF 7,299,270 ops/sec EMStask 0: 1,000,000 readFE 7,194,244 ops/sec EMStask 0: 1,000,000 writeEF 7,299,270 ops/sec EMStask 0: 1,000,000 copy 3,521,126 ops/sec EMStask 0: 1,000,000 recopy 3,484,320 ops/sec EMStask 0: 1,000,000 c=a*b 2,364,066 ops/sec EMStask 0: 1,000,000 c+=a*b 1,715,265 ops/sec EMStask 0: 1,000,000 checksum Infinity ops/sec ------------------------ START NATIVE ARRAY EMStask 0: 1,000,000 write 333,333,333 ops/sec EMStask 0: 1,000,000 rd/write 250,000,000 ops/sec EMStask 0: 1,000,000 read 125,000,000 ops/sec EMStask 0: 1,000,000 reread 250,000,000 ops/sec EMStask 0: 1,000,000 copy 200,000,000 ops/sec EMStask 0: 1,000,000 rmwcopy 250,000,000 ops/sec EMStask 0: 1,000,000 c=a*b 100,000,000 ops/sec EMStask 0: 1,000,000 c+=a*b 142,857,142 ops/sec EMStask 0: 1,000,000 sum Infinity ops/sec real 0m2.938s user 0m8.190s sys 0m0.172s ``` __NOTE:__ Adding timing instrumentation to the Fork-Join version is left as an exercise to the reader. ## Benchmark Results Benchmark results for the BSP implementation on an AWS `c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory`. <center><img src="../../Docs/streams.svg" type="image/svg+xml" height="300px"> </center>