The CSC Benchmarks (Spring 1993) by Sami Saarinen Center for Scientific Computing P.O. Box 405 Tietotie 6 FIN-02101 Espoo Finland e-mail: sbs@csc.fi April 9, 1993 Table of contents 1. Introduction ................................................. 2. Contents of the benchmarks ................................... 2.1 A subset of the SPEC-benchmarks ............................. 2.1.1 Reporting the SPEC-results ................................ 2.1.2 Accessing the SPEC-benchmarks ............................. 2.2 The CSC Benchmark Suite, version 2 .......................... 2.2.1 Benchmark program characteristics ......................... 2.2.2 Run rules ................................................. 2.2.3 Reporting the CSC Benchmark Suite results ................. 2.3 Parallel benchmarks ......................................... 2.4 Miscellaneous benchmarks .................................... 3. Performance metrics .......................................... 3.1 Wall clock time ............................................. 3.1.2 Sum of running times ...................................... 3.2 Ratio of wall clock times ................................... 3.3 Nominal flop count .......................................... 3.4 Application performance (nominal Mflop/s) ................... 3.5 Benchmark performance ....................................... 3.6 Benchmark averages .......................................... 3.6.1 Geometric mean ............................................ 3.6.2 Arithmetic mean ........................................... 3.6.3 Harmonic mean ............................................. 3.7 Benchmark instability ....................................... 3.8 Parallel program measures ................................... 3.8.1 Parallel speedup .......................................... 3.8.2 Parallel efficiency ....................................... 3.9 Measuring communication overhead ............................ 3.9.1 Computation to communication ratio ........................ 3.9.2 Modeling communication time ............................... 4. Execution instructions ....................................... 4.1 Reading the benchmark tape .................................. 4.2 Directories, files and utility functions .................... 4.2.1 CSCSUITE_2 ................................................ 4.2.2 PARALLEL .................................................. 4.2.3 MISC ...................................................... 4.2.4 Timer functions ........................................... 4.2.5 Controlling the number of processors ...................... 4.3 Compiling and linking the applications ...................... 4.3.1 The CSC Benchmark Suite ................................... 4.3.2 Parallel benchmarks ....................................... 4.3.3 Miscellaneous benchmarks .................................. 4.4 Running the benchmarks ...................................... 4.4.1 The CSC Benchmark Suite ................................... 4.4.2 Parallel benchmarks ....................................... 4.4.3 Miscellaneous benchmarks .................................. 5. Contact information .......................................... Appendices: A: Configuration of the CSC reference computer system ........... B: A short description of the SPEC-benchmarks ................... C: Guidelines to report system configuration .................... D: Description of the CSC Benchmark Suite, version 2 ............ E: PARMACS macro calls .......................................... 1. Introduction This document describes the CSC Benchmarks to be run by selected vendors in Spring 1993 during computer evaluation project. First, I will go through the contents of the benchmarks. After that I will talk about some benchmark metrics that we are interested in. Finally, the detailed benchmark instructions are followed. This benchmark has been very carefully planned and should be easy port to any UNIX-based systems having proper Fortran-77 and C-compilers. I have included things like 'run script' and 'makefile' -generators to facilitate the installation, executable building and running the benchmarks. Also, detailed information about the programs' characteristics is included. This contains description of each individual program, the reference results on our reference computer system, the nominal megaflop-count collected by help of Cray Hardware Performance Monitor (hpm) and some additional information that would categorize the benchmark programs. Our reference computer system is Silicon Graphics Indigo R4000. More detailed information about the system is found in appendix A. 2. Contents of the benchmarks The benchmark is divided into the four parts: (1) A subset of the Standard Perfomance Evaluation Cooperative (SPEC) benchmarks (2) The CSC benchmark suite, version 2 (3) Parallel benchmarks (4) Miscellaneous benchmarks Most of the benchmarks are applied to all vendors, if not otherwise stated. The following subsections will explain these benchmarks. Where appropriate, the appendices are used to list more detailed information. 2.1 A subset of the SPEC-benchmarks The aim of this benchmark is to report the per processor performance of the proposed system by following the run rules specified by Standard Perfomance Evaluation Cooperative (SPEC) cooperative. Because most of the computer manufacturers are currently members of the SPEC organization (or have access to their benchmarks), this benchmark does not (in average) require any explicit running of the benchmarks since the results are reported in SPEC's newsletter quarterly. We are specifically interested in SPEC Floating Point Suite 92 (SPECfp92 or CFP92) and SPEC Integer Suite 92 (SPECint92 or CINT92) break down or per program results. The SPEC CFP92 benchmark suite consists of CPU intensive benchmarks that are intended to be meaningful samples of applications which perform floating point logic and computations in a technical computing environment. The SPEC CINT92 benchmark suite consists of CPU intensive benchmarks that are intended to be meaningful samples of applications which perform non-floating point logic and computations in a technical computing environment. Note that neither CFP92 nor CINT92 does not assess the ability of a system under test to handle disk, graphics or any form of networking or communication. Many of the SPEC benchmarks have been derived from publicly-available application programs, and they are intended to be portable to as many current and future hardware platforms as possible. Appendix B contains a brief description of each SPEC-benchmark program. 2.1.1 Reporting the SPEC-results We are interested in individual turn-around times for each program. Note however, that unlike SPEC, we will compare these times to our reference computer. The vendors should report the SPEC-results in the following form (wall clock times in the table indicate reference computer results reported by the Silicon Graphics Inc. in the SPEC-newsletter Vol.4, No.3, September 1992, page 28 and 18): SPECfp92: ____________________________________ | Program | Wall clock time | | | (seconds) | |_______________|____________________| | 013.spice2g6 | 541.5 | | 015.doduc | 36.3 | | 034.mdljdp2 | 88.0 | | 039.wave5 | 89.1 | | 047.tomcatv | 40.8 | | 048.ora | 101.6 | | 052.alvinn | 101.5 | | 056.ear | 272.7 | | 077.mdljsp2 | 76.8 | | 078.swm256 | 359.2 | | 089.su2cor | 176.7 | | 090.hydro2d | 182.0 | | 093.nasa7 | 230.8 | | 094.fpppp | 171.5 | |_______________|____________________| SPECint92: ____________________________________ | Program | Wall clock time | | | (seconds) | |_______________|____________________| | 008.espresso | 41.1 | | 022.li | 95.0 | | 023.eqntott | 14.4 | | 026.compress | 66.9 | | 072.sc | 61.9 | | 085.gcc | 124.8 | |_______________|____________________| In addition to this, the vendor should report the configuration of the benchmarked system as specified in appendix C. 2.1.2 Accessing the SPEC-benchmarks In case some vendor(s) have currently no access to the benchmarks source codes, they can be purchased from the following address (attn. Dianne Dean): SPEC [Standard Performance Evaluation Corporation] c/o NCGA [National Computer Graphics Association] 2722 Merrilee Drive Suite 200 Fairfax, VA 22031 USA Phone: +1-703-698-9600 Ext. 318 FAX: +1-703-560-2752 E-Mail: spec-ncga@cup.portal.com The prices for CINT92 and CFP92 release 1.1 QIC 24 -tapes are 425USD and 575USD, respectively. 2.2 The CSC Benchmark Suite, version 2 This benchmark comprises of a set of programs that represents the average load of the current computer systems at CSC. The benchmark consists of 14 floating point intensive programs. All except one C-code are written in Fortran-77. A description of each program is found in appendix D. 2.2.1 Benchmark program characteristics Each program in this benchmark set has been carefully analyzed by Cray's Hardware Performance Monitor (hpm) and using some general tools available under Unix operating system. In addition to that reference timings are provided for the reference system, Silicon Graphics Indigo R4000. Following table provides some static information about the programs: ______________________________________________________________________ | Program Prec. Source | Text Data+Bss TotalSize | DiskSpace | | (bits) lines | (KB) + (KB) = (KB) | (KB) | |____________________________|_____________________________|___________| | ARCTWOD 64 3759 | 344 2788 3132 | 942 | | CASTEP 64 13337 | 548 3680 4228 | 2642 | | FREQUENCY 32 276 | 212 25777 25989 | 628 | | GRSOS 32 316 | 196 19613 19809 | 457 | | INVPOW93 64 486 | 220 2930 3150 | 6408 | | MOPAC 64 22093 | 676 26824 27500 | 1294 | | NASKER 64 1101 | 200 2877 3077 | 351 | | NBODYOPT 32 1459 | 152 1224 1376 | 299 | | RIEMANN (*) 64 439 | 48 74 122 | 86 | | SIMCZO 64 2069 | 252 35960 36212 | 837 | | WHY12M 64 996 | 220 23742 23962 | 18201 | | MD1 32 1129 | 168 9230 9398 | 327 | | PDE1 64 207 | 144 37041 37185 | 287 | | QCD1 64 2641 | 252 7763 8015 | 448 | |____________________________|_____________________________|___________| (*) Written in C-language. All programs should confirm the precision shown above. Note, that GRSOS contains some REAL*8 (DOUBLE PRECISION) operations that should not be removed, but the data-arrays are still INTEGERs and REALs. Note also, that all 64-bit codes, except CASTEP contain explicit DOUBLE PRECISION definition (either by IMPLICIT-statement or variable-wise). This means, the vendor has to activate DOUBLE PRECISION via appropriate compiler flag or, if such does not exist, manually include IMPLICIT DOUBLE PRECISION -statements at the beginning of each routine in CASTEP. Furthermore, CASTEP seems to treat all variables that begin with letter 'C' as COMPLEX. CASTEP may also require some intrinsic functions (CEXP, AIMAG, AMOD) to be changed (EXP, IMAG, MOD). This change can be made by hand or running CASTEP sources through the '/lib/cpp -P' with appropriate definitions (-DCEXP=EXP -DAIMAG=IMAG -DAMOD=MOD). If the C-preprocessor is invoked by the Fortran-compiler as default, include these definitions as 'FFLAGS' (see chapter 4.2). Some of the codes also require automatic SAVE-statements to be activated. I rely on the fact that all vendors have such a compiler flag available. I recommend to use "auto-SAVE" for all programs, but especially CASTEP, MOPAC, SIMCZO, MD1, PDE1 and QCD1. Some of the programs are suitable for fine grain parallelism. We have found that for example ARCTWOD, FREQUENCY, GRSOS, INVPOW93, NASKER may benefit from parallel processors. The programs can also be divided by the application area. Following table tries to give to each program a typical application area(s): ________________________________________________________ | Program Application area(s) | |________________________________________________________| | ARCTWOD Fluid Dynamics, Engineering | | CASTEP Chemistry, Physics | | FREQUENCY Engineering, Mathematics | | GRSOS Physics | | INVPOW93 Engineering, Mathematics, Eigenvalues | | MOPAC Chemistry | | NASKER Fluid Dynamics, Engineering, Mathematics | | NBODYOPT Astrophysics, Mathematics | | RIEMANN Mathematics | | SIMCZO Structural Mechanics, Fluid Dynamics | | WHY12M Mathematics, Sparse Matrices | | MD1 Chemistry | | PDE1 Mathematics, Partial Differential Equations | | QCD1 Quantum Mechanics | |________________________________________________________| In the next table approximate nominal megaflop-counts are derived from Cray's 'hpm' Group-0 and Group-3 executions and reference timings for the reference computer system. Nominal megaflop-count and performance are described more detailly in the 'Performance metrics' chapter later on. Table also provides additional information gathered from 'hpm': o Vectorization percentage, which is calculated from the ratio of vector floating point operations to total number of floating point operations (vector & scalar) on Cray X-MPEA. o Average vector length (Avg.VL), which is expressed in modulo 64 since Cray X-MPEA's vector registers length is 64. o Memory references tell how many accesses to 64-bit precision words have been occurred during the execution of the program. The application performance for the reference system is calculated by dividing the nominal megaflop-count on Cray X-MPEA by the wall clock time on the reference system. This is NOT the REAL Mflop/s, only a fairly good approximation in most cases. However, very high values (reaching or exceeding the theoretical peak) may indicate that the compiler and/or the pre-processor has done a good job or the program takes advantage of the 32-bit arithmetic that is not available on Cray X-MPEA. _______________________________________________________________________ |Program Floating point Memory | Wall clock | Application | | Nom.Ops. Vector Avg.VL Refs. | time (sec) | Performance | |__________________________________________|______________|_____________| | ARCTWOD 555M 100% 53 692M | 81.2 | 6.8 | | CASTEP 3211M 69% 43 3880M | 450.7 | 7.1 | | FREQUENCY 517M 100% 63 1405M | 84.2 | 6.1 | | GRSOS 9162M 97% 59 3192M | 446.3 | 20.5 | | INVPOW93 1014M 100% 57 1521M | 143.8 | 7.0 | | MOPAC 1256M 58% 11 1573M | 162.1 | 7.7 | | NASKER 2149M 100% 53 2169M | 294.5 | 7.3 | | NBODYOPT 824M 86% 10 547M | 64.2 | 12.8 | | RIEMANN 248M 0% 0 2527M | 72.2 | 3.4 | | SIMCZO 1356M 94% 32 2311M | 659.4 | 2.1 | | WHY12M 164M 3% 0 1017M | 124.7 | 1.3 | | MD1 837M 22% 63 576M | 92.4 | 9.1 | | PDE1 494M 100% 60 522M | 92.3 | 5.4 | | QCD1 1314M 94% 64 1792M | 268.7 | 4.9 | |__________________________________________|______________|_____________| | Total 23101M 66% 41 23724M | 3036.7 | 7.6 | |__________________________________________|______________|_____________| From these reference results following additional information is derived: ______________________________ ____________________ | Statistic | Reference system | | | value | |______________________________|____________________| | Benchmark performance | 7.6 | | Geometric mean performance | 5.9 | | Arithmetic mean performance | 7.2 | | Harmonic mean performance | 4.6 | | Benchmark instability | 15.8 | |______________________________|____________________| Please refer chapter 3 to understand these metrics. 2.2.2 Run rules Performance results should be given for two versions of the codes: o Baseline o Optimized CSC provides the Baseline source codes. It is up to vendor to optimize Baseline source codes in order produce the Optimized versions. We will consider this as a credit to the vendor. In addition to this, varying number of processors must be used to accomplish the tasks. For the Baseline and Optimized runs we following number of processors must be used: o single processor results o P-processor results o P/2-processor results [P/2+1 is P is odd] o P_opt-processor results where P = maximum number of proposed processors. P_opt = number of processors that will give optimal performance each individual program. This means that in principle maximum of 2x4 executions of each benchmark program is required in order to run complete benchmark suite. However, it is left to vendor to decide whether to optimal number of processors differs from P-processor results. When running any of the benchmark programs, the following general run rules must be followed: o All times reported must be for runs that produce correct results. o All information necessary for replication of the results should be disclosed and available at request. o Single-user mode is allowed, but must be reported. o Use of benchmark specific software (preprocessors etc.) is not allowed. Note, that this does NOT prevent use regular preprocessors provided by vendors nor use of KAP, VAST etc. However, all performance improvers must be included in the proposed system, too. To obtain the Baseline results, the vendor must obey following additional rules: o Source code may not be modified unless it is required for portability. This includes manual insertion of compiler directives. All changes must be reported. o No use of scientific libraries that would replace the original code is allowed unless this is done automatically by the compiler and does not increase the compilation/linking time dramatically. Thus, the general rule for obtaining the Baseline results is to give compiler and preprocessors full freedom to optimize as much as possible without need of any making source code modifications ny hand. To obtain the Optimized results, the vendor CAN do the following things: o Modify by hand the source as required, but still solving the same problem and providing the same output (final and intermediate) using the same input files. o Insert compiler directives. o Insert calls to the scientific libraries 2.2.3 Reporting the CSC Benchmark results Before reporting the actual run times, the vendor should report the compilation and linking times (wall clock time) for each application and case. We will consider it as a credit if the compiler and linker do not spend too much time in optimizing codes. For the Baseline and the Optimized results, the vendor should report the CSC Benchmark Suite results in the following form: ___________________________________ ___________________ ___________________ | | # of CPUs = 1 | # of CPUs = P/2 | # of CPUs = P | | Program |___________________|___________________|___________________| | | Wall clock time | Wall clock time | Wall clock time | | | (seconds) | (seconds) | (seconds) | |_______________|___________________|___________________|___________________| | ARCTWOD | | | | | CASTEP | | | | | FREQUENCY | | | | | GRSOS | | | | | INVPOW93 | | | | | MOPAC | | | | | NASKER | | | | | NBODYOPT | | | | | RIEMANN | | | | | SIMCZO | | | | | WHY12M | | | | | MD1 | | | | | PDE1 | | | | | QCD1 | | | | |_______________|___________________|___________________|___________________| To report the "P_opt-processor" results, create similar table than above, but specify also the number of processors used to run each of the programs. This means that the actual value of "P_opt" may vary from program to program. 2.3 Parallel benchmarks 2.3.1 CSCSUITE_2 This item was already covered in 2.2.3. 2.3.2 DM DM or distributed memory benchmarks consists of 4 programs, of which one belong to miscellaneous tests (COMMS1). The rest three are DM-versions of MD1, PDE1 and QCD1. They are coded in Fortran, but contain system dependent PARMACS message passing macros (see Appendix E) to facilitate process to process communication. The following cases should be at least run (2, 4, 8, 16 processors): o MD1 - NC=9 i.e. 4*9^3 (2916) & NC=11 i.e. 4*11^3 (5324) atoms o PDE1 - NN=6 & NN=7 o QCD1 - 16*4^3 and 8*8^3 systems The brief instructions to these tests are found in chapter 4. 2.4 Miscellaneous benchmarks The brief instructions to these tests are found in chapter 4. 3. Performance metrics In this chapter some crucial performance metrics used in this report are described. 3.1 Wall clock time This is known also as real elapsed time or turn-around time or time-to-solution or running time. It is measured from the beginning of program to the end of application. It is the only relevant measure while comparing parallel applications with each other. In any benchmark context CPU-times are often mentioned. During this benchmark we are not very much interested in these figures. 3.1.2 Sum of running times Given a set of wall clock times for different benchmark applications, the sum of running times is a simply sum of wall clock times. During this benchmark it has no special meaning, but is included for for the sake of completeness. 3.2 Ratio of wall clock times Assuming that wall clock times have been measured for a particular application. The ratio of wall clock times is application wall clock time over reference computer system's wall clock time. 3.3 Nominal flop count Nominal flop count are gathered from Cray's Hardware Performance Monitor (hpm) by counting the Cray hardware multiplies, adds and reciprocals for a particular application and converting them to nominal megaflop count using following formula: Nominal flop count = Multiplies + Adds - 2 * Reciprocals The formulr stems from the fact that one divide on Cray hardware requires 1 reciprocal and 3 multiplies (or 2 multiplies and one add). Although the REAL flop count varies from system to system, this value is probably the best value that we can reliably get. And it stays constant once obtained. 3.4 Application performance (nominal Mflop/s) Dividing the program Nominal flop count (expressed normally in megaflops) by the wall clock time in a particular machine, we get a fairly good measure for application performance that resembles the famous Mflop/s metric: Cray's Nominal flop count Application performance = _________________________________________ Application wall clock time on any machine This can also be misleading, if interpreted incorrectly. Namely smart compiler may optimize away a lot of floating point operations that current Cray compiler was not able to. Also 32-bit codes may perform "unexpectedly" fast. We recommend to call it as an "application performance", NOT a Mflop/s. 3.5 Benchmark performance Benchmark performance is defined by the ratio of sum of all applications' nominal flop counts to sum of all applications' wall clock times: Sum of applications' nominal flop counts Benchmark performance = ________________________________________ Sum of applications' wall clock times 3.6 Benchmark averages Various summaries can be drawn from the results of several individual program executions. These summaries are sometimes characterized by different averages. Although we don't rely on averages as much as individual program results, it is still worth to mention them. One of the most popular and less misleading average is the Geometric mean. It is not very sensitive to large variations in data values. Sometimes very misleading ones are the Arithmetic and Harmonic means. The former one will give unexpectly good averages if only one data value gets high. The latter one in turn works in the other way around: if there is one bad result (low value), it will destroy the whole data set average. 3.6.1 Geometric mean Geometric mean is defined as a Nth root of a product of N data values: 1/N Geometric mean = ( Product {data_i} ) i=[1..N] 3.6.2 Arithmetic mean Arithmetic mean is defined as an average of N data values: Arithmetic mean = ( Sum {data_i} ) / N i=[1..N] 3.6.3 Harmonic mean Harmonic mean is defined as a reciprocal average of N reciprocal data values: N Harmonic mean = ___________________________ Sum {1 / data_i} i=[1..N] 3.7 Benchmark instability Benchmark instability of a tested machine is defined by taking the ratio of maximum attained application performance to minimum attained application performance of the programs included in the benchmark set: Maximum attained application performance Benchmark instability = __________________________________________ Minimum attained application performance 3.8 Parallel program measures Parallel program measures stem from the fact that run time for any application is essentially composed of two parts: sequential and parallel portion's time, T_s and T_par, respectively. Thus, one-processor wall clock time is expressed in form: T(1) = T_s + T_par Running the same application in parallel with P-processors will result the run time to be: T(P) = T_s + T_par / P This is, however, only an approximation, since increasing the number of processors will also require synchronization and communication to be included. Thus, in reality the run time for parallel application is higher: T(P) = T_s + T_par / P + T_sc(P) where T_sc is the time spent in synchronization and communication between processors and is typically a monotonically increasing function of the number of processors P. The latter formula is the reason why we want the vendor to run certain applications with "P_opt-processors". In reality, if we exclude so called embarrasingly parallel applications, no continuous performance increase is gained by increasing the number of processors. This is because typical parallel application suffers from synchronization and communication bottlenecks and after certain number of processors this throuphput time becomes larger, thus making application to run slower eventhough more processors are put to work together. In the distributed memory benchmarks we would also like to see the size effect to the solution time. If the problem size is N (defined appropriately), then solution time varies according to following formula: T(P,N) = T_s(N) + T_par(N) / P + T_sc(P,N) 3.8.1 Parallel speedup In order to measure parallel speedup, it is currently recommended by the benchmark authorities, that the speedup itself should not be used to compare results between different architectures. However, using is to test application speedup within a single architecture is not a bad idea. Parallel speedup is defined: One processor wall clock time Parallel speedup = _____________________________ P-processor wall clock time Note, that if the code has been optimized, then comparing the speedups of baseline results to optimized is illegal. This is because the one processor version performs now much better than in baseline resulting lower parallel speedups for optimized results. A general rule is: the better attained single processor performance the lower parallel speedup. Note also, that parallel speedup may exceed actual number of processors if the problem does not fit properly into the memory of one-processor system. This results too high one-processor execution than what would be expected if there were enough memory available. 3.8.2 Parallel efficiency This is defined by the following formula and expressed in percentages: One processor wall clock time Parallel efficiency = ___________________________________ x 100% (P-processor wall clock time) * P 3.9 Measuring communication overhead In distributed memory applications which use mainly message passing to exchange data between processors, a communication becomes an important aspect. During data exchange processes are normally busy with sendign or receiving the data, thus creating serious bottlenecks to parallel code. 3.9.1 Computation to communication ratio One way to measure the goodness of message passing application is to keep track of the time spent in communication. The computation to communication ratio is thus defined: Wall clock time - Communication time Comp.Comm. ratio = _______________________________________ Communication time The larger the ratio, the better performance and less communication. 3.9.2 Modeling communication time Communication itself can be modeled by the following linearized formula: Communication time = Latency_time + Transfer_Speed * Number_of_Bytes The Latency_time is essentially the time spent while sending zero length message. The Transfer_Speed is the speed of communication network. This is usually also a weak function of number of bytes transferred. 4. Execution instructions 4.1 Reading the benchmark tape The benchmark streamer tape (cartridge) is written on Sun Sparcstation's tape drive using 'tar'-command. Before reading it to disk, specify benchmark root directory. I will refer it here through the environment variable BENCH: % setenv BENCH /my/benchmark/root/directory % mkdir $BENCH % cd $BENCH To unload the tape, type: % tar xv In case of alternate tape drive, use command like this: % tar xvf /dev/other_tape_drive Some systems may not recognize the tape format. In such case you may have to use 'dd'-command with bytes swaps: % dd if=/dev/tape_drive ibs=obs conv=swab | tar xvf - After this run following command to install the benchmark files properly: % install This will do the rest; for example create additional directories, uncompress few compressed tar-files found in tape, create some useful symbolic links. In case of serious problems in unloading the tape, I will put the tar-file also into our anonymous ftp nic.funet.fi (128.214.6.100), under directory pub/csc/benchmark. Use binary transfer more to retrieve the benchmark tar-file named 'spring93.tar'. 4.2 Directories, files and utility functions Once you have unloaded the tape, you will find following subdirectories under your benchmark root directory: o CSCSUITE_2 o PARALLEL o MISC These refer to the test sets described in the chapter 2. Naturally there are no files for the SPEC-benchmarks. In the following subsections I will use next two abbreaviations quite often: o - refers to application name (ARCTWOD, CASTEP, etc.) o - refers to computer architecture (SGI, CRAY, etc.) Also, there are two important files that appear time to time during use of command scripts: o program.list - contains list of application codes o .make_flags - contains default modifications to standard makefile settings For example, file 'CRAY.make_flags' may look like this: ARCTWOD - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' CASTEP - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' FREQUENCY - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' GRSOS - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' INVPOW93 - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' MOPAC - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' NASKER - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' NBODYOPT - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' RIEMANN - C # CC=cc SIMCZO - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' WHY12M - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' MD1 - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' PDE1 - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' QCD1 - Fortran # FC=cf77 ;LD=cf77 ;FFLAGS=-Zv -Wf'-dp -a static' A brief explanation to this. Take for example ARCTWOD: o ARCTWOD is written in Fortran o override default FC-setting (FC=f77) in makefile with cf77 o override default LD-setting (LD=f77) with cf77 o use new flags for Fortran compilation: -Zv -Wf'-dp -a static' The format is simply the following (one program per line): o Program name o dash ('-') o language (keywords 'Fortran' or 'C') o hash ('#') o overrides: NAME=new_value o separator between overrides is semicolon (';') This file is parsed by using 'awk' during generation of application specific makefiles. 4.2.1 CSCSUITE_2 CSCSUITE_2 contains several shell scripts and also subdirectories for each application to run the tests described in section 2.2. In addition to that a subdirectory called "lib" is found. This is a depository for timer-function routines and library. Each application directory contains files and directories organized in the following way: Directories: o /src - fsplit'ted Fortran or C-source codes o /results - reference results directory (.out) Files: o /.exe - the executable for the application o /.in - standard input file (if any) o /.out - standard output file (if any) o /* - other files: miscellaneous input o /src/Makefile. - Architecture dependent makefile Following shell scripts are found directly under CSCSUITE_2: o Makefile - A driver makefile for some scripts o program.list - List of application programs. Used by scripts. o .make_flags - Default compiler/linker etc. flags used upon generation of architecture dependent makefile o build_script - to build Bourne-shell run script for a single application. See also 4.2.5 for limiting CPU-number. o build_all_scripts - to build all scripts in one shot o build_make - to build architecture dependent makefile for specific application o build_all_makes - to build all architecture dependent makefiles o make_program - to invoke architecture dependent makefile for specific application o make_all_programs - to make all programs, including libraries o run_program - to run specific program o run_all_programs - to run all programs after each other o get_times - a utility to collect times from /.out's o get_diffs - a utility to compare results with reference system 4.2.2 PARALLEL PARALLEL directory contains two subdirectories CSCSUITE_2 and DM. CSCSUITE_2 is essentially a duplicate of CSCSUITE_2-directory that has been created during installation of the tape. Its role is to provide similar environment than CSCSUITE_2, but for running 14 application codes now in parallel, P > 1. I added also few new scripts to facilitate this: o run_parallel - to run application in parallel with varying number of processors o get_parallel_times - to get tabulated list of parallel times for a single application DM is a directory for distributed memory benchmarks COMMS1, MD1, PDE1 and QCD1 as well as depository for the PARMACS - PVM 2.4.2 -interface. Interface calls are documented briefly in appendix E. The test program COMMS1 does not actually belong to DM-applications, but miscellaneous tests. As it contains PARMACS macros, it should be run under this directory. It is explained briefly under 4.2.3, MISC. 4.2.3 MISC MISC contains subdirectories and files for miscellaneous tests: o MEMTEST - Several memory tests: - CACHEMISS to test hardware behaviour during cache conflicts - LARGEMEM to test how large single array can be allocated - MEMSCAN to scan arrays with stride one or ramdomly o IOTEST - I/O-subsystem -tests - IOZONE to write/read file with several block/filesizes o KERNEL - Kernel operations test - BRE to run 15 different kernels tests Although MISC directory does not contain communication test (COMMS1), such is included under PARALLEL/DM -directory. Purpose of this test is to test communication speed between two nodes when message size varies from 1 to 40000 bytes. 4.2.4 Timer functions Throughout the benchmark subdirectory 'lib/' contains timer-functions. As we are not interested in CPU-times in this benchmark, but wall clock times, a utility function to get this has been coded. Times are obtained by using C-routines that are called by Fortran-routines. Fortran-routines in turn are called by user program. Some minor (or none) modficications are needed to link properly the timer-interface. Fortran-routines ('lib/src/timer.f') are: - SUBROUTINE INITIM(IDUMMY) to initialize timer (once a run). Done implicitly by the library when any of the Fortran-routines found in 'lib/src/timer.f' are called. - SUBROUTINE SHOTIM(IDUMMY) to print out current wall clock time since initialization of timer. - SUBROUTINE TIMER(T) , DOUBLE PRECISION T to store wall clock time since initialization to the variable 'T' - DOUBLE PRECISION FUNCTION CPUTIM() returns actually the wall clock time since initialization. Among the C-routines ('lib/src/times.c') that Fortran-routines above call, the most important is 'waltim'. As this is called by the Fortran-routines, some computer systems may need the underscore to be appended after routine name ('waltim_') or capitalized expression be used ('WALTIM'). Not all benchmarks use these timer routines. In case the first time any benchmark program is executed and the execution fails, I recommend to the 'lib/' directory. 4.2.5 Controlling the number of processors In some systems, especially shared memory multiprocessors, it is possible explicitly to specify the number of processors to be used in runtime. In order to be sure that a proper number of processors is in use, modify 'build_script' command procedure before trying to generate run scripts at any time. For example, Silicon Graphics requires environment variable 'MP_SET_NUMTHREADS' to be set to one (1) when one-processor results are needed. 4.3 Compiling and linking the applications 4.3.1 CSCSUITE_2 In order to successfully compile and link applications, change to CSCSUITE_2 directory and go through following steps: (1) Run 'build_all_scripts' to create Bourne-shell run-scripts. Modify, if necessary, files 'build_script' or 'build_all_scripts'. NOTE: Be sure that your application really uses ONE processor in this context. Modify 'build_script' accordingly. (2) Create file .make_flags to contain default settings for your system for subsequent makefile-generation step. (Hint: Use file 'SGI.make_flags' as example) (3) Generate makefile's under /src -directories using command 'build_all_makes'. (4) Check validity of just created makefiles, /src/Makefile.. (5) Make the executables and record the compilation & linking time for each application. Use command 'make_all_programs'. If make fails, lower the optimization level for particular source by modifying corresponding /src/Makefile., and re-run make. Change the source code only in extreme cases. Report the changes. Here is what I did when I built the executables for SGI: % build_all_scripts % emacs SGI.make_flags # See beginning of section 4.2 % build_all_makes % make_all_programs SGI clean # Clean all possible junk % make_program lib SGI # Be sure 'timer.a' exist. See 4.2.4 % make_all_programs SGI -n # Trial run; don't make anything yet % make_all_programs SGI # Make really now! After that I had to lower the optimization level in one of the CASTEP routine. So I modified CASTEP's makefile manually and re-ran the make: % emacs CASTEP/src/Makefile.SGI % make_program SGI CASTEP clean # Clean first to remove garbage % make_program SGI CASTEP To record the compilation & linking time for each application, you can make each program in the following manner: % time make_program and write down the time. 4.3.2 Parallel benchmarks 4.3.2.1 CSCSUITE_2 Follow the same rules as in 4.3.1, but make changes to .make_flags to activate parallel compilation. Be sure that while running any of the applications, the number of processors is set either to 'P/2' or 'P' (or P_opt !). Thus you must check 'build_script' before building scripts and refer the result table in the chapter 2.2.3 4.3.2.2 DM Generally, follow the same rules as in 4.3.1, but make changes to .make_flags. As this benchmark contains 4 tests for distributed memory computing that use PARMACS macros presented in Appendix E, some site depended changes are needed. Source files for this tests are written in Fortran (extension .f) or using m4-macro processors language (.m4 files). Before actual Fortran compilation the m4-macro files containing PARMACS-macros must be preprocessed to get the Fortran equivalents. To facilitate this "preprocessing", we provide the vendor with CSC-developed PARMACS -- PVM 2.4.2 interface. In order to successfully link the DM-application with our PVM-interface, following steps must be performed: o Go to 'pvm2.4.2/' directory, read the Postscipt-instructions how install pvm for your machine (in case you are not familiar with it), and replace libraries 'libpvm.a', 'libfpvm.a' and PVM-daemon executable 'pvmd' with your equivalents. o Go to 'macrolib/' and check the Fortran-interface that is called by the application program once the PARMACS calls have been substituted. The source files under 'macrolib/src' use some C-routines, like 'getarg()', 'iargc()', 'getcwd()', sleep(). Check that your system accepts these to be called from Fortran. In case you wish to use vendor specific PARMACS-routines, some other path must be followed. As these, rather difficult, steps have been completed, you may have to modify following files: o build_make - to re-organize libraries o .make_flags - to provide additions/substitutes to compiler flags Once all seems to work, do as in 4.3.1. Note that each application will has two executables: 'host' and 'node'. They are also fixed size master and slave process executables. 4.3.3 Miscellaneous benchmarks 4.3.3.1 Memory tests This contains three separate tests: CACHEMISS, LARGEMEM and MEMSCAN. These are located in the directories MISC/MEMTEST/CACHEMISS, MISC/MEMTEST/LARGEMEM and MISC/MEMTEST/MEMSCAN, respectively. Please refer the 'Readme' files there for detailed information. The CACHEMISS is a C-program that performs a full matrix multiply with increasing stride. We plot a curve where X-axis is the problem size (matrix dimension or effective stride) and Y-axis contains nominal Mflop/s rate calculated by the program. The vendor is encouraged to provide equivalent curve for an optimized version, that probably is written in Fortran or uses library routines to perform the task. To create CACHEMISS, go to its directory, modify 'Makefile' and create executable. LARGEMEM is a small program that is used to test how large single array can be allocated in Fortran and in what circumstances. It contains four cases, in each case the array to be allocated is in DOUBLE PRECISION: o mem1.m4 - array is to be allocated (probably) from stack o mem2.m4 - array is put to a named COMMON-block o mem3.m4 - array is put to a unnamed COMMON-block o mem4.m4 - array is SAVE'd ("static") To create LARGEMEM, go to its directory, and run 'generate.csh'. This actually executes the actual programs, too. MEMSCAN is a "memory scanner" that performs one dimensional array operations, like summing, scaling, saxpy, assigment and so on. Four different versions exist: o sequential and random scan (gather/scatter -like) o single and double precision for both above To create MEMSCAN, go to its directory, modify 'Makefile.SGI' and run it. 4.3.3.2 I/O-tests This consists of modified IOZONE-test. Purpose of this test is to check file buffer caching effect. Program writes sequentially N byteblocks of data into a file with size X megabytes. Then file is closed and opened again, but this time for reading. A version that explicitely does 'fsync()' before 'close()' is run against version that does not 'fsync()'. The purpose of 'fsync()' is enforce write to disk and then return back to application. I hope all vendor can provide similar functionality, if 'fsync()' itself does not exist. To create IOZONE test programs, IOTEST/IOZONE directory, modify 'Makefile.SGI' and run it. 4.3.3.3 Computational kernel test This test contains 15 BLAS-1 or -2 operations that are supposed to test single processor performance in kernel operations. The test is located in KERNELS/BRE directory. To create the 'BRE.exe', modify makefile and run it. 4.4 Running the benchmarks In order to run any of the benchmarks, be sure that there's enough disk space available in the planned run directory. This directory can be located in another disk than the benchmark directory. Also, the root directory for the runs must exist. The run scripts do not refer to any particular run directory, other than specified when program was invoked. In the following sections we refer the run directory as . 4.4.1 The CSC Benchmark Suite To run the CSC Benchmark Suite, you can use the following scripts: o run_program - to run a single program o run_all_programs - to run all programs after each other Consider 'run_program'. It will run a single program in the following manner (assume working directory CSCSUITE_2): o Checks that exist. If not, tries to create it. o Creates application run directory /. If this already exist (from previous trial run, for example), rename it to /old.. And if in turn /old. already exist, delete all files under it. o Copies all regular files (not directories or files under directories) found under to /. o Changes directory ('cd') to / and start running the application. o Upon completion, file //.out is copied back to /.out. Use 'get_times' and 'get_diffs' to collect timing information and differences compared to reference results. 'get_times' applies directly to /.out files for each application creating a tabulated timing summary for application. 'get_diffs' similarly compares differences on output files /.out and reference result files /results/.out. A sample execution: % run_program RIEMANN /tmp/bench # Runs RIEMANN under /tmp/bench % run_all_programs /tmp/bench # Runs all program in sequence % get_times > .summary # A tabulated list of timings % get_diffs > .diff # A list of differences 4.4.2 Parallel benchmarks 4.4.2.1 CSCSUITE_2 Generally, follow the rules in sequential CSCSUITE_2. Check also that the processor number is correct. Following scripts are new or modified since reading the sequential CSCSUITE_2: o build_script o get_parallel_times o get_parallel_diffs o run_parallel A sample execution: % run_parallel ARCTWOD /tmp/paral 2 4 # Run ARCTWOD using 2 and 4 procs under /tmp/paral % get_parallel_times > .summary # A tabulated list of timings % get_parallel_diffs > .diff # A list of differences 4.4.2.2 DM All DM-benchmarks consists of several cases to be run per application. The input files are found in /.cases* files. Each case label can be recognized in the input file name. For example, the case '11_8x1x1' refers to file 'MD1/MD1.case11_8x1x1'. The case label means: run MD1 with 4 times 11 cubed atoms, use processor topology 8 by 1 by 1 processors (8 processors in ring). It is up to vendor to choose the to choose the topology. Only number of processors matters. For instance, the case above can be replaced with '11_2x2x2' if the vendor think this provide faster turn-around time. But also input file 'MD1/MD1.case11_2x2x2' with the corresponding changes must be embedded. If the PVM-interface is acceptable, following scripts may found helpful: o run_dm - runs one DM-application with variable number of "cases". o run_all_dms - runs all required DM-applications in sequence. o run_pvm - invoked by 'run_dm'. Check whether PVM-daemon is already running and prevents accidental starting of additional PVM-daemon(s). o kill_pvm - kills currently active PVM-daemon which in turn kills all processes that communicate with this daemon. A sample execution: % run_dm MD1 /tmp/dm 11_2x1x1 11_8x1x1 # Runs two cases of MD1 under /tmp/dm % kill_pvm # Be sure that PVM-daemon is down % run_all_dms /tmp/bench # Runs all DM-programs in sequence % kill_pvm # Be sure that PVM-daemon is down % get_dm_times > .summary # A tabulated list of timings 4.4.3 Miscellaneous benchmarks All the following run scripts are found under corresponding program directory. Run CACHEMISS by typing 'cachemiss.csh'. Apply 'matlabgen.csh' to get Matlab-suitable data file for curve plotting. Run LARGEMEM by typing 'generate.csh'. Run MEMSCAN by typing 'run_memscan'. Run IOZONE by typing 'run_iozone'. And 'get_times' applied to logfile. Run BRE kernel test by typing 'run_bre'. Modify BRE.dat -file before that. Run communication test 'COMMS1' as a part of DM-application. 5. Contact information Any questions about the benchmark should be directed to me or Klaus Lindberg. I will be in holiday between April 13 and 25. Here is more information: Sami Saarinen (or Klaus Lindberg) Center for Scientific Computing (CSC) Tietotie 6 P.O.Box 405 FIN-02101 Espoo Finland Tel: Int + 358 - 0 - 457 2713 (Sami, direct) Int + 358 - 0 - 457 4050 (Klaus, direct) Int + 358 - 0 - 457 1 (switch board) Fax: Int + 358 - 0 - 457 2302 And as stated in section 4.1, the benchmark file is obtainable via anonymous ftp in case of tape reading problems. The results should be send to us in printed format and, if possible, in a tar-file using streamer tapes, reels or so. We can even provide you a ftp access to some specific place at our center to send us the result data. Appendix A Configuration of the CSC reference computer system: (1) Hardware Configuration: Manufacturer: Silicon Graphics Inc. Model number: INDIGO R4000 CPU: MIPS R4000 Processor Chip Revision: 2.2 FPU: MIPS R4010 Floating Point Chip Revision: 0.0 Speed: 50 MHZ IP20 Processor Peak performance: 50 Mflop/s (64-bit) Number of CPUs: 1 Data cache size: 8 Kbytes Instruction cache size: 8 Kbytes Secondary unified instruction/data cache size: 1 Mbyte Main memory size: 96 Mbytes Disk subsystem: 1.6GB + 1.2GB + 0.4GB SCSI Other Hardware: None Network Interface: Integral Ethernet: ec0, version 1 (2) Software Configuration: O/S & Version: IRIX 4.0.5F Compilers & Version: SGI Fortran 77, 3.4.1 SGI Ansi C, 1.1 Compiler flags: -O2 -static -sopt,-so=3,-r=3,-ur=8 -jmpopt -lfastm (except in programs CASTEP, GRSOS, MOPAC, NBODYOPT, RIEMANN, SIMCZO, WHY12M and MD1 where "-O2 -static" was used) Other Software: Fortran 77 Fopt Scalar Optimizer (KAP) File system type: SGI efs (3) System Environment: System state: Multi-user Tuning Parameters: None Background load: None Appendix B CFP92, current release: Rel. 1.1: This suite contains 14 benchmarks performing floating-point computations. 12 of them are written in Fortran, 2 in C. The individual programs are: 013.spice2g6 Simulates analog circuits (double precision). 015.doduc Performs Monte-Carlo simulation of the time evolution of a thermo-hydraulic model for a nuclear reactor's component (double precision). 034.mdljdp2 Solves motion equations for a model of 500 atoms interacting through the idealized Lennard-Jones potential (double precision). 039.wave5 Solves particle and Maxwell's equations on a Cartesian mesh (single precision). 047.tomcatv Generates two-dimensional, boundary-fitted coordinate systems around general geometric domains (vectorizable, double precision). 048 ora Traces rays through an optical surface containing spherical and planar surfaces (double precision). 052.alvinn Trains a neural network using back propagation (single precision). 056.ear Simulates the human ear by converting a sound file to a cochleogram using Fast Fourier Transforms and other math library functions (single precision). 077.mdljsp2 Similar to 034.mdljdp2, solves motion equations for a model of 500 atoms (single precision). 078.swm256 Solves the system of shallow water equations using finite difference approximations (single precision). 089.su2cor Calculates masses of elementary particles in the framework of the Quark Gluon theory (vectorizable, double precision). 090.hydro2d Uses hydrodynamical Navier Stokes equations to calculate galactical jets (vectorizable, double precision). 093.nasa7 Executes seven program kernels of operations used frequently in NASA applications, such as Fourier transforms and matrix manipulations (double precision). 094.fpppp Calculates multi-electron integral derivatives (double precision). CINT92, current release: Rel. 1.1: This suite contains 6 benchmarks performing integer computations, all of them are written in C. The individual programs are: 008.espresso Generates and optimizes Programmable Logic Arrays. 022.li Uses a LISP interpreter to solve the nine queens problem, using a recursive backtracking algorithm. 023.eqntott Translates a logical representation of a Boolean equation to a truth table. 026.compress Reduces the size of input files by using Lempel-Ziv coding. 072.sc Calculates budgets, SPEC metrics and amortization schedules in a spreadsheet based on the UNIX cursor- controlled package "curses". 085.gcc Translates preprocessed C source files into optimized Sun-3 assembly language output. Appendix C Use following guidelines to report system configuration under test: (1) Hardware Configuration: A description of the system (cpu/clock, fpu/clock), number of processors, relevant peripherals, etc. are included in this space. The amount of information supplied should be sufficient to allow duplication of results by another party. The following checklist is provided to show examples of hardware features which may affect the performance of benchmarks. The checklist is not intended to be all-inclusive, nor is each feature in the list required to be described. The rule of thumb is: "if it affects performance or the feature is required to duplicate the results, describe it": o Manufacturer o Model number o CPU component ID and clock speed o Floating Point Unit (FPU) ID and clock speed o Theoretical peak performance in Mflop/s (32- and 64-bit floating point operations) per processor o Number of CPUs, FPUs and vector units o Cache Size (per CPU), description and organization o Memory (Amount and description) o Disk subsystem configuration - Number of active connected disks - Disk controllers: ID, number and type - Disk: ID, number and type o Other Hardware o Network Interface (2) Software Configuration: The description of the software configuration used should be detailed enough so that another party could duplicate the results reported. The following is a list of items that should be documented: o Operating System Type and Release level (with revision/update level if appropriate) o Compiler release levels used. If multiple compilers are available, specify which ones were used and when o Compiler flags used per test program/source file o Other Software required to reproduce results o File system type used o Firmware level (3) System Environment: This section should document the overall systems environment used to run the benchmark. The following is a list of possible items that might be documented: o Single or multi-user state o System tuning parameters o Process tuning parameters (e.g. stack size, time slice) o Background load, if any Appendix D CSC Benchmark Suite: Version 2 ______________________________ ARCTWOD Program solves the Euler equations in generalized curvelinear coordinates using implicit finite-difference method with approximated factorizations and a diagonalization of the implicit operators. The test case involves with 0.8-Mach flow over upper of a biconvex airfoil in a stretched 165x48 rectangular mesh. The original code (ARC2D) was developed by Thomas Pulliam at the NASA Ames Research Center. CASTEP Program stands for CAmbridge Serial Total Energy Package and is an Ab Initio Molecular Dynamics program with Conjugate Gradient minimization for the electronic energy. The Al-Si-Al trimer, arranged originally in linear geometry, is calculated using Car-Parrinello-type first principles density functional method. The wave functions are expanded using plane waves, and the Fourier coefficients are solved using Conjugate Gradient technique. The atomic positions are solved by minimizing the total energy with respect to the atomic coordinates. The example program does 25 iterations. CASTEP is maintained and partially developed by Victor Milman at the Cambridge University, UK. The input file for the test runs was prepared by Ari Seitsonen at the Helsinki University of Technology. FREQUENCY Program creates frequency dependent complex transfer function matrix and calculates magnitude and phase angle for a given excitation in a given degree of freedom. To accomplish the task, the program uses 50 unevenly distributed excitation frequencies in a range of (0,24] Hz and ten lowest eigenvectors, that represent dynamic behaviour of a given structure in that range. The original code was written as a part of Sami Saarinen's Master Thesis on "A Numerical Study of the Harmonical Vibrations on the Off-Machine Coater". GRSOS Program is a restricted version of the solid-on-solid (SOS) growth model. It is basically an algorithm that models the deposition of particles on a substrate. The restriction is that the deposition occurs only of the difference in height between the site and its nearest neighbors is less than one. The SOS model itself does not allow overhangs or bubbles. The model can be used to study the temporal development of growth processes. The problem size is set to 1000 and program makes 5 iterations through the Monte-Carlo loop. GRSOS was developed by Tuomo Ala-Nissila at the Helsinki University of Technology while visiting the Brown University. INVPOW93 Program extracts 4 lowest eigenvalues and eigenvectors of a given symmetric 605x605 matrix. It uses Inverse Power method with Deflation after each converged (tolerance is 1.0E-6) eigenvalue. The Deflation process is carried out by using temporary disk storage and contains relatively large amount of sequential binary I/O. INVPOW93 was developed by Sami Saarinen at the CSC. MOPAC Program is a general purpose semi-empirical molecular orbital package for the study of chemical structures and reactions. In the test case the thermodynamic quantities of anisole (internal energy, heat capacity, partition function, and entropy) are calculated by using the AM1-method for translation rotation and vibrational degrees of freedom for the default temperature range 200-400 K. Version of the program is 5.0. MOPAC was developed by James J. P. Stewart at the US Air Force Academy. The input file was prepared by Raimo Uusvuori at the CSC. NASKER Program executes 7 program kernels used frequently in NASA and other scientific computing applications. These kernel operations are: MXM to perform matrix product on two input matrices, CFFT2D to perform radix 2 FFT on a two dimensional input array, CHOLSKY to perform a Cholesky decomposition in parallel on a set of input matrices, BTRIX to perform a block tridiagonal matrix solution, GMTRY to set up arrays for a vortex method solution and to perform Gaussian elimination on the resulting array, EMIT to create new vortices according to certain boundary conditions, and VPENTA to simultaneously invert three matrix pentadiagonals in a highly parallel fashion. NASKER was collected from other NASA programs by David Bailey and John Barton at the NASA Ames Research Center. NBODYOPT This program computes the relative locations of ten planets by integrating equations of motions with the 7th order Adams-Stoermer -difference method. Program contains a lot of trigonometric and other intrinsic function calls. The original code was written in Algol and converted to Fortran-77 by Hannu Karttunen at the CSC. The current version contains some obvious code optimization constructs that may remove unnecessary cache memory conflicts. RIEMANN Program calculates period matrices of real algebraic curves. Given a set of generators and fixed points of the generators, the algorithm calculates the period matrices using a partial sum approximation (using elements of the group up to a given word length). The algorithm is recursive and uses complex-number arithmetic programmed in C. Numerical under and overflow situations are not checked for, so one should use normalized matrices as input and limit the word length to a small enough number. RIEMANN was developed and input file prepared by Juha Haataja at the CSC. SIMCZO Program is a development version of a numerical simulation for the Czochralski crystal growth. Based on the Finite Element Method, program simulates the crystal growth in the axisymmetric case including the free boundaries between the solid and liquid and liquid and gas, respectively. The fluid flow in the liquid is governed by the coupled Navier-Stokes and diffusion-convection equations. The temperature distribution in the crystal is governed by the diffusion-convection equation. SIMCZO was developed and input file prepared by Jari Jarvinen at the CSC. WHY12M Program solves a system of linear equations by Gaussian elimination using certain sparse matrix techniques. The program is designed to solve efficiently problems which contain only one system with a single right hand side. The order of the matrix is set to 4500 and the population of non-zero elements is less than 1%. Program writes a checkpoint-file to make an efficient restarts possible. The original code (Y12M) was developed at the Regional Computing Center at the University of Copenhagen. The input file was created by Sami Saarinen at the CSC. MD1 This Molecular dynamics program essentially involves solution of the equations of motion for a system of a large number of interacting particles. The problem is solved numerically by calculating approximate solutions of the second order differential equations at a large number of time steps in a given interval 'dt'. The particles are considered to interact through an effective pair potential which is adjusted to reproduce experimental results and is used to model the exact many-body potential. For monatomic neutral atoms the form of pair potential most often employed is the Lennard-Jones potential. A typical MD calculation for N particles involves for each time step calculation of the forces on the N particles and their new positions, and calculation of energies and radial distribution functions. In this test the number of atoms is 4 times NC cubed, where NC is set to 11 in a non-distibuted memory version, thus simulation is carried out with 5324 atoms. The number of time steps is set to 600. MD1 was developed by Mark Pinches at the University of Southampton, UK. Sequential and distributed memory versions are part of the European parallel benchmarking effort, Genesis. PDE1 Program solves the Poisson-Equation on a 3-dimensional grid by parallel red-black relaxation. It is an extreme example of the class of PDE solvers, as due to the simplicity of the discretization of Poisson's equation the number of floating point operations per gridpoint is quite small relative to more complex PDEs. The ratio of computation to communication is thus rather low. The parallelization is performed by grid splitting. A part of the computational grid is assigned to each processor. After each computational step, values at the boundary of the subgrids are exchanged with nearest neighbours. The sequential version automatically produces results for a range of problem sizes. The problem size is determined by the grid size, which is related to the parameter N. The number of grid points in each direction is 2**N + 1. For the non-distibuted memory version N is set to 7 and standard implementation of the red-black relaxation is used. PDE1 was developed by J. Klose and M. Lemke at the PALLAS GmbH. Sequential and distributed memory versions are part of the European parallel benchmarking effort, Genesis. QCD1 Program is based on a 'pure gluon' SU(3) lattice gauge theory simulation, using the Monte-Carlo heatbath technique. It uses the 'quenched' approximation which neglects dynamical fermions. The simulation is defined on a four-dimensional lattice which is a discrete approximations to continuum space-time. The basic variables are 3 by 3 complex matrices. Four such matrices are associated with every lattice site. The lattice update is performed using a multi-hit Metropolis algorithm. In the parallel version of the program, the lattice can be distributed in any one or more of the four lattice directions. The lattice size is based on a 4-dimensional space-time lattice of size: N = NS **3 * NT, where NT & NS are even integers. For the non-distibuted memory version NT & NS are set to 8 and start configuration is disordered. QCD1 was developed by Eckardt Kehl at PALLAS GmbH. Sequential and distributed memory versions are part of the European parallel benchmarking effort, Genesis. Appendix E PARMACS macro calls used in this benchmarks: [Ref: Parallel Computing 15 (1990) 199-132, North-Holland] ENVHOST Declaration macros for host and node program. Put before the ENVNODE first data or executable statement in Fortran-77 source. INITHOST The first executable statement in host and node main program. INITNODE TORUS(nx,ny,nz,slave,process_file) Sets up a (nx,ny,nz) grid of (logical) processors and defines the executable to be named 'slave'. Information to be used by 'REMOTE_CREATE' is passed to the temporary file 'process_file' (not used by the PVM-interface). REMOTE_CREATE(process_file,process_id_vector) A macro for process creation. Read information from 'process_file' that was written by TORUS-macro. In return 'process_id_vector' contains instance numbers (thread/process id's) of just created 'slave's. MYPROC A variable to identify own process id. HOSTID A variable to identify host process id. SEND(target_id,buffer,buffer_len,msgtype) Asyncroniously send 'buffer_len' amount of bytes pointed by 'buffer' to the process id 'target_id'. The message identification type is 'msgtype'. (Note: My PVM-interface encodes the 'msgtype' to be internally (during send) 'msgtype * MAXNODES + target_id'. This made the PARMACS RECV easier to implement) SENDR(target_id,buffer,buffer_len,msgtype) Same as SEND(...), but syncronious. (PVM-interface: SEND). RECV(buffer,buffer_len,actual_len,sender,msgtype,condition) Receives 'buffer' of data which length cannot exceed 'buffer_len' bytes. The actual length will be stored in 'actual_len'. The sender instance was 'sender' and the message type was actually 'msgtype'. Receive is made after 'condition' is fulfilled: o MATCH_ID(select_sender) : Receives any message that comes next from the selected sender. o MATCH_TYPE(select_type) : Receives any message that has the selected type. o MATCH_ID_AND_TYPE(select_sender,select_type) : Both of the previous conditions must be true. o MATCH_ANY : Receive any message that arrives. (PVM-interface: RECV polls receive buffer until respective UDP-message that fulfills the 'condition' has been arrived) RECVR(buffer,buffer_len,actual_len,sender,msgtype,condition) Same as RECV(...), but syncronious. (PVM-interface: RECV). CLEAN_UP This macro MUST BE the last statement to be executed before leaving the host or node program. (PVM-interface: Nodes stay in 'barrier()' until all nodes have reached the barrier. When host exits, it tries to 'terminate()' possible hanging nodes.)