PandA2024.02

* * * * * * * * * * * * * * * * PolyBench/C 4.2.1 (beta) * * * * * * * * * * * * * * * * Copyright (c) 20112016 the Ohio State University. Contact: LouisNoel Pouchet <pouchet@cse.ohiostate.edu> Tomofumi Yuki <tomofumi.yuki@inria.fr> PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.). PolyBench features include:  A single file, tunable at compiletime, used for the kernel instrumentation. It performs extra operations such as cache flushing before the kernel execution, and can set realtime scheduling to prevent OS interference.  Nonnull data initialization, and liveout data dump.  Syntactic constructs to prevent any dead code elimination on the kernel.  Parametric loop bounds in the kernels, for generalpurpose implementation.  Clear kernel marking, using pragmabased delimiters. PolyBench is currently available in C and in Fortran:  See PolyBench/C 4.2.1 for the C version  See PolyBench/Fortran 1.0 for the Fortran version (based on PolyBench/C 3.2) Available benchmarks (PolyBench/C 4.2.1) Benchmark Description 2mm 2 Matrix Multiplications (alpha * A * B * C + beta * D) 3mm 3 Matrix Multiplications ((A*B)*(C*D)) adi Alternating Direction Implicit solver atax Matrix Transpose and Vector Multiplication bicg BiCG Sub Kernel of BiCGStab Linear Solver cholesky Cholesky Decomposition correlation Correlation Computation covariance Covariance Computation deriche Edge detection filter doitgen Multiresolution analysis kernel (MADNESS) durbin Toeplitz system solver fdtd2d 2D Finite Different Time Domain Kernel gemm Matrixmultiply C=alpha.A.B+beta.C gemver Vector Multiplication and Matrix Addition gesummv Scalar, Vector and Matrix Multiplication gramschmidt GramSchmidt decomposition head3d Heat equation over 3D data domain jacobi1D 1D Jacobi stencil computation jacobi2D 2D Jacobi stencil computation lu LU decomposition ludcmp LU decomposition followed by Forward Substitution mvt Matrix Vector Product and Transpose nussinov Dynamic programming algorithm for sequence alignment seidel 2D Seidel stencil computation symm Symmetric matrixmultiply syr2k Symmetric rank2k update syrk Symmetric rankk update trisolv Triangular solver trmm Triangular matrixmultiply See the end of the README for mailing lists, instructions to use PolyBench, etc.  * New in 4.2.1beta:   Fix a bug in PAPI support, introduced in 4.2  Support PAPI 5.4.x  * New in 4.2:   Fixed a bug in syr2k.  Changed the data initialization function of several benchmarks.  Minor updates in the documentation and PolyBench API.  * New in 4.1:   Added LICENSE.txt  Fixed minor issues with cholesky both in documentation and implementation. (Reported by François Gindraud)  Simplified the macros for switching between data types. Now users may specify DATA_TYPE_IS_XXX where XXX is one of FLOAT/DOUBLE/INT to change all macros associated with data types.  * New in 4.0a:   Fixed a bug in jacobi1d (Reported by Sven Verdoolaege)  * New in 4.0:  This update includes many changes. Please see CHANGELOG for detailed list of changes. Most of the benchmarks have been edited/modified by Tomofumi Yuki, thanks to the feedback we have received by PolyBench users for the past few years.  Three benchmarks are out: dynprog, regdetect, fdtdapml.  Three benchmarks are in: nussinov, deriche, heat3d.  Jacobi1D and Jacobi2D perform two time steps in one time loop iteration alternating the source and target fields, to avoid the field copy statement.  Almost all benchmarks have been edited to ensure the computation result matches the mathematical specification of the operation.  A major effort on documentation and harmonization of problem sizes and data allocations schemes. * Important Note:  PolyBench/C 3.2 kernels had numerous implementation errors making their outputs to not match what is expected from the mathematical specification of the operation. Many of them did not influence the program behavior (e.g., the number and type of operations, data dependences, and overall controlflow was similar to the corrected implementation), however, some had nonnegligible impact. These are described below.  adi: There was an offbyone error, which made back substitution part of a pass in ADI to not depend on the forward pass, making the program fully tilable.  syrk: A typo on the loop bounds made the iteration space rectangular instead of triangular. This has led to additional dependences and two times more operations than intended.  trmm: A typo on the loop bounds led to the wrong half of the matrix being used in the computation. This led to additional dependences, making it harder to parallelize this kernel.  lu: An innermost loop was missing for the operation to be valid on general matrices. This cause the kernel to perform about half the work compared to a general implementation of LU decomposition. The new implementation is the generic LU decomposition. In addition, some of the kernels used "highfootprint" memory allocation for easier parallelization, where variables used in accumulation were fully expanded. These variables were changed to only use a scalar.  * New in 3.2:   Rename the package to PolyBench/C, to prepare for the upcoming PolyBench/Fortran and PolyBench/GPU.  Fixed a typo in polybench.h, causing compilation problems for 5D arrays.  Fixed minor typos in correlation, atax, cholesky, fdtd2d.  Added an option to build the test suite with constant loop bounds (default is parametric loop bounds)  * New in 3.1:   Fixed a typo in polybench.h, causing compilation problems for 3D arrays.  Set by default heap arrays, stack arrays are now optional.  * New in 3.0:   Multiple dataset sizes are predefined. Each file comes now with a .h header file defining the dataset.  Support of heapallocated arrays. It uses a single malloc for the entire array region, the data allocated is cast into a C99 multidimensional array.  One benchmark is out: gauss_filter  One benchmark is in: floydwarshall  PAPI support has been greatly improved; it also can report the counters on a specific core to be set by the user.  * Mailing lists:  ** polybenchannounces@lists.sourceforge.net:  Announces about releases of PolyBench. ** polybenchdiscussion@lists.sourceforge.net:  General discussions reg. PolyBench.  * Available benchmarks:  See utilities/benchmark_list for paths to each files. See doc/polybench.pdf for detailed description of the algorithms.  * Sample compilation commands:  ** To compile a benchmark without any monitoring:  $> gcc I utilities I linearalgebra/kernels/atax utilities/polybench.c linearalgebra/kernels/atax/atax.c o atax_base ** To compile a benchmark with execution time reporting:  $> gcc O3 I utilities I linearalgebra/kernels/atax utilities/polybench.c linearalgebra/kernels/atax/atax.c DPOLYBENCH_TIME o atax_time ** To generate the reference output of a benchmark:  $> gcc O0 I utilities I linearalgebra/kernels/atax utilities/polybench.c linearalgebra/kernels/atax/atax.c DPOLYBENCH_DUMP_ARRAYS o atax_ref $> ./atax_ref 2>atax_ref.out  * Some available options:  They are all passed as macro definitions during compilation time (e.g, Dname_of_the_option). ** Typical options:   POLYBENCH_TIME: output execution time (gettimeofday) [default: off]  MINI_DATASET, SMALL_DATASET, MEDIUM_DATASET, LARGE_DATASET, EXTRALARGE_DATASET: set the dataset size to be used [default: STANDARD_DATASET]  POLYBENCH_DUMP_ARRAYS: dump all liveout arrays on stderr [default: off]  POLYBENCH_STACK_ARRAYS: use stack allocation instead of malloc [default: off] ** Options that may lead to better performance:   POLYBENCH_USE_RESTRICT: Use restrict keyword to allow compilers to assume absence of aliasing. [default: off]  POLYBENCH_USE_SCALAR_LB: Use scalar loop bounds instead of parametric ones. [default: off]  POLYBENCH_PADDING_FACTOR: Pad all dimensions of all arrays by this value [default: 0]  POLYBENCH_INTER_ARRAY_PADDING_FACTOR: Offset the starting address of polybench arrays allocated on the heap (default) by a multiple of this value [default: 0]  POLYBENCH_USE_C99_PROTO: Use standard C99 prototype for the functions. [default: off] ** Timing/profiling options:   POLYBENCH_PAPI: turn on papi timing (see below).  POLYBENCH_CACHE_SIZE_KB: cache size to flush, in kB [default: 33MB]  POLYBENCH_NO_FLUSH_CACHE: don't flush the cache before calling the timer [default: flush the cache]  POLYBENCH_CYCLE_ACCURATE_TIMER: Use Time Stamp Counter to monitor the execution time of the kernel [default: off]  POLYBENCH_LINUX_FIFO_SCHEDULER: use FIFO realtime scheduler for the kernel execution, the program must be run as root, under linux only, and compiled with lc [default: off]  * PAPI support:  ** To compile a benchmark with PAPI support:  $> gcc O3 I utilities I linearalgebra/kernels/atax utilities/polybench.c linearalgebra/kernels/atax/atax.c DPOLYBENCH_PAPI lpapi o atax_papi ** To specify which counter(s) to monitor:  Edit utilities/papi_counters.list, and add 1 line per event to monitor. Each line (including the last one) must finish with a ',' and both native and standard events are supported. The whole kernel is run one time per counter (no multiplexing) and there is no sampling being used for the counter value.  * Accurate performance timing:  With kernels that have an execution time in the orders of a few tens of milliseconds, it is critical to validate any performance number by repeating several times the experiment. A companion script is available to perform reasonable performance measurement of a PolyBench. $> gcc O3 I utilities I linearalgebra/kernels/atax utilities/polybench.c linearalgebra/kernels/atax/atax.c DPOLYBENCH_TIME o atax_time $> ./utilities/time_benchmark.sh ./atax_time This script will run five times the benchmark (that must be a PolyBench compiled with DPOLYBENCH_TIME), eliminate the two extremal times, and check that the deviation of the three remaining does not exceed a given threshold, set to 5%. It is also possible to use POLYBENCH_CYCLE_ACCURATE_TIMER to use the Time Stamp Counter instead of gettimeofday() to monitor the number of elapsed cycles.  * Generating macrofree benchmark suite:  (from the root of the archive:) $> PARGS="I utilities DPOLYBENCH_TIME"; $> for i in `cat utilities/benchmark_list`; do perl utilities/create_cpped_version.pl $i "$PARGS"; done This create for each benchmark file 'xxx.c' a new file 'xxx.preproc.c'. The PARGS variable in the above example can be set to the desired configuration, for instance to create a full C99 version (parametric arrays): $> PARGS="I utilities DPOLYBENCH_USE_C99_PROTO"; $> for i in `cat utilities/benchmark_list`; do perl utilities/create_cpped_version.pl $i "$PARGS"; done  * Utility scripts:  create_cpped_version.pl: Used in the above for generating macro free version. makefilegen.pl: generates make files in each directory. Options are globally configurable through config.mk at polybench root. headergen.pl: refers to 'polybench.spec' file and generates header in each directory. Allows default problem sizes and datatype to be configured without going into each header file. runall.pl: compiles and runs each kernel. clean.pl: runs make clean in each directory and then removes Makefile.