Extrae
User guide manual
for version 3.1.1rc

tools@bsc.es


Contents


List of Figures


List of Tables


1. Quick start guide

1.1 The instrumentation package

1.1.1 Uncompressing the package

Extrae is a dynamic instrumentation package to trace programs compiled and run with the shared memory model (like OpenMP and pthreads), the message passing (MPI) programming model or both programming models (different MPI processes using OpenMP or pthrads within each MPI process). Extrae generates trace files that can be latter visualized with Paraver .

The package is distributed in compressed tar format (e.g., extrae.tar.gz). To unpack it, execute from the desired target directory the following command line :

               gunzip -c extrae.tar.gz | tar -xvf -

The unpacking process will create different directories on the current directory (see table 1.1).


Table 1.1: Package contents description
2tabbg1tabbg2
Directory Contents
bin Contains the binary files of the Extrae tool.
etc Contains some scripts to set up environment variables and the
  Extrae internal files.
lib Contains the Extrae tool libraries.
share/man Contains the Extrae manual entries.
share/doc Contains the Extrae manuals (pdf, ps and html versions).
share/example Contains examples to illustrate the Extrae instrumentation.


1.1.2 Post-configuring the package

There are some files within Extrae that contain references to libraries given at configure time. Because of this, you need to adapt the installation to your system. In order to do that Extrae provides an automatic mechanism that post-configures the package. Once you have installed Extrae , just set EXTRAE_HOME environment variable to the directory where you have untarred it and execute ${EXTRAE_HOME}/bin/extrae-post-installation-upgrade.sh. This script will guide you into some questions about the location of several libraries needed by Extrae . The script shows the current value for the library directories and gives the user the chance to change them. In case the libraries were unused at configure time, thet current value will be an empty string.

Figure 1.1: An example of the extrae-post-installation-upgrade.sh script execution
\begin{figure}\centering
\psfig{figure=images/extrae-post-installation.eps, width=12cm}\end{figure}

1.2 Quick running


There are several included examples in the installation package. These examples are installed in ${EXTRAE_HOME}/share/example and cover different application types (including serial/MPI/OpenMP/CUDA/etc). We suggest the user to look at them to get an idea on how to instrument their application.

Once the package has been unpacked, set the EXTRAE_HOME environment variable to the directory where the package was installed. Use the export or setenv commands to set it up depending on the shell you use. If you use sh-based shell (like sh, bash, ksh, zsh, ...), issue this command

               export EXTRAE_HOME=dir
however, if you use csh-based shell (like csh, tcsh), execute the following command
               setenv EXTRAE_HOME dir
where dir refers where the Extrae was installed. Henceforth, all references to the usage of the environment variables will be used following the sh format unless specified.

Extrae is offered in two different flavors: as a DynInst-based application, or stand-alone application. DynInst is a dynamic instrumentation library that allows the injection of code in a running application without the need to recompile the target application. If the DynInst instrumentation library is not installed, Extrae also offers different mechanisms to trace applications.


1.2.1 Quick running Extrae - based on DynInst

Extrae needs some environment variables to be setup on each session. Issuing the command

               source ${EXTRAE_HOME}/etc/extrae.sh

on a sh-based shell, or

               source ${EXTRAE_HOME}/etc/extrae.csh

on a csh-based shell will do the work. Then copy the default XML configuration file1.1 into the working directory by executing

               cp ${EXTRAE_HOME}/share/example/MPI/extrae.xml .

If needed, set the application environment variables as usual (like OMP_NUM_THREADS, for example), and finally launch the application using the ${EXTRAE_HOME}/bin/extrae instrumenter like:

               ${EXTRAE_HOME}/bin/extrae -config extrae.xml <program>

where <program> is the application binary.


1.2.2 Quick running Extrae - NOT based on DynInst

Extrae needs some environment variables to be setup on each session. Issuing the command

               source ${EXTRAE_HOME}/etc/extrae.sh

on a sh-based shell, or

               source ${EXTRAE_HOME}/etc/extrae.csh

on a csh-based shell will do the work. Then copy the default XML configuration file1.1into the working directory by executing

               cp ${EXTRAE_HOME}/share/example/MPI/extrae.xml .

and export the EXTRAE_CONFIG_FILE as

               export EXTRAE_CONFIG_FILE=extrae.xml

If needed, set the application environment variables as usual (like OMP_NUM_THREADS, for example). Just before executing the target application, issue the following command:

               export LD_PRELOAD=${EXTRAE_HOME}/lib/<lib>

where <lib> is one of those listed in Table 1.2.


Table 1.2: Available libraries in Extrae. Their availability depends upon the configure process.
3tabbg1tabbg2 0.85
Library Application type
Serial MPI OpenMP pthread SMPss nanos/OMPss CUDA OpenCL Java
libseqtrace Yes
libmpitrace1.2 Yes
libomptrace Yes
libpttrace Yes
libsmpsstrace Yes
libnanostrace Yes
libcudatrace Yes
libocltrace Yes
javaseqtrace.jar Yes
libompitrace1.2 Yes Yes
libptmpitrace1.2 Yes Yes
libsmpssmpitrace1.2 Yes Yes
libnanosmpitrace1.2 Yes Yes
libcudampitrace1.2 Yes Yes
libcudaompitrace1.2 Yes Yes Yes
liboclmpitrace1.2 Yes Yes


1.3 Quick merging the intermediate traces

Once the intermediate trace files (*.mpit files) have been created, they have to be merged (using the mpi2prv command) in order to generate the final Paraver trace file. Execute the following command to proceed with the merge:

               ${EXTRAE_HOME}/bin/mpi2prv -f TRACE.mpits -o output.prv

The result of the merge process is a Paraver tracefile called output.prv. If the -o option is not given, the resulting tracefile is called EXTRAE_Paraver_Trace.prv.


2. Introduction

Extrae is a dynamic instrumentation package to trace programs compiled and run with the shared memory model (like OpenMP and pthreads), the message passing (MPI) programming model or both programming models (different MPI processes using OpenMP or pthreads within each MPI process). Extrae generates trace files that can be visualized with Paraver .

Extrae is currently available on different platforms and operating systems: IBM PowerPC running Linux or AIX, and x86 and x86-64 running Linux. It also has been ported to OpenSolaris and FreeBSD.

The combined use of Extrae and Paraver offers an enormous analysis potential, both qualitative and quantitative. With these tools the actual performance bottlenecks of parallel applications can be identified. The microscopic view of the program behavior that the tools provide is very useful to optimize the parallel program performance.

This document tries to give the basic knowledge to use the Extrae tool. Chapter 3 explains how the package can be configured and installed. Chapter 8 explains how to monitor an application to obtain its trace file. At the end of this document there are appendices that include: a Frequent Asked Questions appendix and a list of routines instrumented in the package.

What is the Paraver tool?

Paraver is a flexible parallel program visualization and analysis tool based on an easy-to-use Motif GUI. Paraver was developed responding to the need of hacing a qualitative global perception of the application behavior by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to decide the points on which to invest the programming effort to optimize an application.

Expressive power, flexibility and the capability of efficiently handling large traces are key features addressed in the design of Paraver . The clear and modular structure of Paraver plays a significant role towers achieving these targets.

Some Paraver features are the support for:

One of the main features of Paraver is the flexibility to represent traces coming from different environments. Traces are composed of state records, events and communications with associated timestamp. These three elements can be used to build traces that capture the behavior along time of very different kind of systems. The Paraver distribution includes, either in its own distribution or as additional packages, the following instrumentation modules:

  1. Sequential application tracing: it is included in the Paraver distribution. It can be used to trace the value of certain variables, procedure invocations, ... in a sequential program.
  2. Parallel application tracing: a set of modules are optionally available to capture the activity of parallel applications using shared-memory, message-passing paradigms, or a combination of them.
  3. System activity tracing in a multiprogrammed environment: an application to trace processor allocations and process migrations is optionally available in the Paraver distribution.
  4. Hardware counters tracing: an application to trace the hardware counter values is optionally available in the Paraver distribution.

Where the Paraver tool can be found?

The Paraver distribution can be found at URL:

http://www.bsc.es/paraver

Paraver binaries are available for Linux/x86, Linux/x86-64 and Linux/ia64, Windows.

In the Documentation Tool section of the aforementioned URL you can find the Paraver Reference Manual and Paraver Tutorial in addition to the documentation for other instrumentation packages.

Extrae and Paraver tools e-mail support is tools@bsc.es.


3. Configuration, build and installation

3.1 Configuration of the instrumentation package

There are many options to be applied at configuration time for the instrumentation package. We point out here some of the relevant options, sorted alphabetically. To get the whole list run configure -help. Options can be enabled or disabled. To enable them use -enable-X or -with-X= (depending on which option is available), to disable them use -disable-X or -without-X.

3.2 Build

To build the instrumentation package, just issue make after the configuration.

3.3 Installation

To install the instrumentation package in the directory chosen at configure step (through -prefix option), issue make install.

3.4 Check

The Extrae package contains some consistency checks. The aim of such checks is to determine whether a functionality is operative in the target (installation) environment and/or check whether the development of Extrae has introduced any misbehavior. To run the checks, just issue make check after the installation. Please, notice that checks are meant to be run in the machine that the configure script was run, thus the results of the checks on machines with back-end nodes different to front-end nodes (like BG/* systems) are not representative at all.

3.5 Examples of configuration on different machines

All commands given here are given as an example to configure and install the package, you may need to tune them properly (i.e., choose the appropriate directories for packages and so). These examples assume that you are using a sh/bash shell, you must adequate them if you use other shells (like csh/tcsh).

3.5.1 Bluegene (L and P variants)

Configuration command:


./configure -prefix=/homec/jzam11/jzam1128/aplic/extrae/2.2.0 -with-papi=/homec/jzam11/jzam1128/aplic/papi/4.1.2.1 -with-bfd=/bgsys/local/gcc/gnu-linux_4.3.2/powerpc-linux-gnu/powerpc-bgp-linux -with-liberty=/bgsys/local/gcc/gnu-linux_4.3.2/powerpc-bgp-linux -with-mpi=/bgsys/drivers/ppcfloor/comm -without-unwind -without-dyninst

Build and installation commands:


make
make install

3.5.2 BlueGene/Q

To enable parsing the XML configuration file, the libxml2 must be installed. As of the time of writing this user guide, we have been only able to install the static version of the library in a BG/Q machine, so take this into consideration if you install the libxml2 in the system. Similarly, the binutils package (responsible for translating application addresses into source code locations) that is available in the system may not be properly installed and we suggest installing the binutils from the source code using the BG/Q cross-compiler. Regarding the cross-compilers, we have found that using the IBM XL compilers may require using the XL libraries when generating the final application binary with Extrae, so we would suggest using the GNU cross-compilers (/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-*).

If you want to add libxml2 and binutils support into Extrae, your configuration command may resemble to:


./configure -prefix=/homec/jzam11/jzam1128/aplic/juqueen/extrae/2.2.1 -with-mpi=/bgsys/drivers/ppcfloor/comm/gcc -without-unwind -without-dyninst -disable-openmp -disable-pthread
-with-libz=/bgsys/local/zlib/v1.2.5
-with-papi=/usr/local/UNITE/packages/papi/5.0.1
-with-xml-prefix=/homec/jzam11/jzam1128/aplic/juqueen/libxml2-gcc
-with-binutils=/homec/jzam11/jzam1128/aplic/juqueen/binutils-gcc
-enable-merge-in-trace

Otherwise, if you do not want to add support for the libxml2 library, your configuration may look like this:


./configure -prefix=/homec/jzam11/jzam1128/aplic/juqueen/extrae/2.2.1 -with-mpi=/bgsys/drivers/ppcfloor/comm/gcc -without-unwind -without-dyninst -disable-openmp -disable-pthread
-with-libz=/bgsys/local/zlib/v1.2.5
-with-papi=/usr/local/UNITE/packages/papi/5.0.1 -disable-xml

In any situation, the build and installation commands are:


make
make install

3.5.3 AIX

Some extensions of Extrae do not work properly (nanos, SMPss and OpenMP) on AIX. In addition, if using IBM MPI (aka POE) the make will complain when generating the parallel merge if the main compiler is not xlc/xlC. So, you can either change the compiler or disable the parallel merge at compile step. Also, command ar can complain if 64bit binaries are generated. It's a good idea to run make with OBJECT_MODE=64 set to avoid this.

3.5.3.1 Compiling the 32bit package using the IBM compilers

Configuration command:


CC=xlc CXX=xlC ./configure -prefix=PREFIX -disable-nanos -disable-smpss -disable-openmp -with-binary-type=32 -without-unwind -enable-pmapi -without-dyninst -with-mpi=/usr/lpp/ppe.poe

Build and installation commands:


make
make install

3.5.3.2 Compiling the 64bit package without the parallel merge

Configuration command:


./configure -prefix=PREFIX -disable-nanos -disable-smpss -disable-openmp -disable-parallel-merge -with-binary-type=64 -without-unwind -enable-pmapi -without-dyninst -with-mpi=/usr/lpp/ppe.poe

Build and installation commands:


OBJECT_MODE=64 make
make install

3.5.4 Linux

3.5.4.1 Compiling using default binary type using MPICH, OpenMP and PAPI

Configuration command:


./configure -prefix=PREFIX -with-mpi=/home/harald/aplic/mpich/1.2.7 -with-papi=/usr/local/papi -enable-openmp -without-dyninst -without-unwind

Build and installation commands:


make
make install

3.5.4.2 Compiling 32bit package in a 32/64bit mixed environment

Configuration command:


./configure -prefix=PREFIX -with-mpi=/opt/osshpc/mpich-mx -with-papi=/gpfs/apps/PAPI/3.6.2-970mp -with-binary-type=32 -with-unwind=$HOME/aplic/unwind/1.0.1/32 -with-elf=/usr -with-dwarf=/usr -with-dyninst=$HOME/aplic/dyninst/7.0.1/32

Build and installation commands:


make
make install

3.5.4.3 Compiling 64bit package in a 32/64bit mixed environment

Configuration command:


./configure -prefix=PREFIX -with-mpi=/opt/osshpc/mpich-mx -with-papi=/gpfs/apps/PAPI/3.6.2-970mp -with-binary-type=64 -with-unwind=$HOME/aplic/unwind/1.0.1/64 -with-elf=/usr -with-dwarf=/usr -with-dyninst=$HOME/aplic/dyninst/7.0.1/64

Build and installation commands:


make
make install

3.5.4.4 Compiling using default binary type using OpenMPI and PACX

Configuration command:


./configure -prefix=PREFIX -with-mpi=/home/harald/aplic/openmpi/1.3.1 -with-pacx=/home/harald/aplic/pacx/07.2009-openmpi -without-papi -without-unwind -without-dyninst

Build and installation commands:


make
make install

3.5.4.5 Compiling using default binary type, using OpenMPI, DynInst and libunwind

Configuration command:


./configure -prefix=PREFIX -with-mpi=/home/harald/aplic/openmpi/1.3.1 -with-dyninst=/home/harald/dyninst/7.0.1 -with-dwarf=/usr
-with-elf=/usr -with-unwind=/home/harald/aplic/unwind/1.0.1
-without-papi

Build and installation commands:


make
make install

3.5.4.6 Compiling on CRAY XT5 for 64bit package and adding sampling

Notice the "-disable-xmltest". As backends programs cannot be run in the frontend, we skip running the XML test. Also using a local installation of libunwind.

Configuration command:


CC=cc CFLAGS='-O3 -g' LDFLAGS='-O3 -g' CXX=CC CXXFLAGS='-O3 -g' ./configure -with-mpi=/opt/cray/mpt/4.0.0/xt/seastar/mpich2-gnu -with-binary-type=64 -with-xml-prefix=/sw/xt5/libxml2/2.7.6/sles10.1_gnu4.1.2 -disable-xmltest -with-bfd=/opt/cray/cce/7.1.5/cray-binutils -with-liberty=/opt/cray/cce/7.1.5/cray-binutils -enable-sampling -enable-shared=no -prefix=PREFIX -with-papi=/opt/xt-tools/papi/3.7.2/v23 -with-unwind=/ccs/home/user/lib -without-dyninst

Build and installation commands:


make
make install

3.5.4.7 Compiling for the Intel MIC accelerator / Xeon Phi

The Intel MIC accelerators (also codenamed KnightsFerry - KNF and KnightsCorner - KNC) or Xeon Phi processors are not binary compatible with the host (even if it is an Intel x86 or x86/64 chip), thus the Extrae package must be compiled specially for the accelerator (twice if you want Extrae for the host). While the host configuration and installation has been shown before, in order to compile Extrae for the accelerator you must configure Extrae like:


./configure -with-mpi=/opt/intel/impi/4.1.0.024/mic -without-dyninst -without-papi -without-unwind -disable-xml -disable-posix-clock -with-libz=/opt/extrae/zlib-mic -host=x86_64-suse-linux-gnu -prefix=/home/Computational/harald/extrae-mic -enable-mic
CFLAGS="-O -mmic -I/usr/include" CC=icc CXX=icpc
MPICC=/opt/intel/impi/4.1.0.024/mic/bin/mpiicc

To compile it, just issue:


make
make install

3.5.4.8 Compiling on a Power CELL processor using Linux

Configuration command:


./configure -with-mpi=/opt/openmpi/ppc32 -without-unwind -without-dyninst -without-papi -prefix=/gpfs/data/apps/CEPBATOOLS/extrae/2.2.0/openmpi/32 -with-binary-type=32

Build and installation commands:


make
make install

3.5.4.9 Compiling on a ARM based processor machine using Linux

If using the GNU toolchain to compile the library, we suggest at least using version 4.6.2 because of its enhaced in this architecture.

Configuration command:


CC=/gpfs/APPS/BIN/GCC-4.6.2/bin/gcc-4.6.2 ./configure -prefix=/gpfs/CEPBATOOLS/extrae/2.2.0
-with-unwind=/gpfs/CEPBATOOLS/libunwind/1.0.1-git
-with-papi=/gpfs/CEPBATOOLS/papi/4.2.0 -with-mpi=/usr -enable-posix-clock -without-dyninst

Build and installation commands:


make
make install

3.5.4.10 Compiling in a Slurm/MOAB environment with support for MPICH2

Configuration command:


export MP_IMPL=anl2 ./configure -prefix=PREFIX
-with-mpi=/gpfs/apps/MPICH2/mx/1.0.8p1..3/32
-with-papi=/gpfs/apps/PAPI/3.6.2-970mp -with-binary-type=64 -without-dyninst -without-unwind

Build and installation commands:


make
make install

3.5.4.11 Compiling in a environment with IBM compilers and POE

Configuration command:


CC=xlc CXX=xlC ./configure -prefix=PREFIX -with-mpi=/opt/ibmhpc/ppe.poe -without-dyninst -without-unwind -without-papi

Build and installation commands:


make
make install

3.5.4.12 Compiling in a environment with GNU compilers and POE

Configuration command:


./configure -prefix=PREFIX -with-mpi=/opt/ibmhpc/ppe.poe -without-dyninst -without-unwind -without-papi

Build and installation commands:


MP_COMPILER=gcc make
make install

3.5.4.13 Compiling Extrae 3.0 in Hornet / Cray XC40 system

Configuration command, enabling MPI, PAPI and online analysis over MRNet.


./configure -prefix=/zhome/academic/HLRS/xhp/xhpgl/tools/extrae/intel -with-mpi=/opt/cray/mpt/7.1.2/gni/mpich2-intel/140
-with-unwind=/zhome/academic/HLRS/xhp/xhpgl/tools/libunwind -without-dyninst -with-papi=/opt/cray/papi/5.3.2.1 -enable-online -with-mrnet=/zhome/academic/HLRS/xhp/xhpgl/tools/mrnet/4.1.0 -with-spectral=/zhome/academic/HLRS/xhp/xhpgl/tools/spectral/3.1 -with-synapse=/zhome/academic/HLRS/xhp/xhpgl/tools/synapse/2.0

Build and installation commands:


make
make install

3.5.4.14 Compiling Extrae 3.0 in Shaheen II / Cray XC40 system

With the following modules loaded


module swap PrgEnv-XXX/YYY PrgEnv-cray/5.2.40
module load cray-mpich

Configuration command, enabling MPI, PAPI with the following modules loaded


./configure -prefix=${PREFIX} -with-mpi=/opt/cray/mpt/7.1.1/gni/mpich2-cray/83 -with-binary-type=64 -with-unwind=/home/markomg/lib-without-dyninst -disable-xmltest -with-bfd=/opt/cray/cce/default/cray-binutils -with-liberty=/opt/cray/cce/default/cray-binutils -enable-sampling -enable-shared=no -with-papi=/opt/cray/papi/5.3.2.1

Build and installation commands:


make
make install

3.6 Knowing how a package was configured

If you are interested on knowing how an Extrae package was configured execute the following command after setting EXTRAE_HOME to the base location of an installation


${EXTRAE_HOME}/etc/configured.sh

this command will show the configure command itself and the location of some dependencies of the instrumentation package.


4. Extrae XML configuration file

Extrae is configured through a XML file that is set through the EXTRAE_CONFIG_FILE environment variable. The included examples provide several XML files to serve as a basis for the end user. There are four XML files:

Please note that most of the nodes present in the XML file have an enabled attribute that allows turning on and off some parts of the instrumentation mechanism. For example, <mpi enabled="yes"> means MPI instrumentation is enabled and process all the contained XML subnodes, if any; whether <mpi enabled="no"> means to skip gathering MPI information and do not process XML subnodes.

Each section points which environment variables could be used if the tracing package lacks XML support. See appendix B for the entire list.

Sometimes the XML tags are used for time selection (duration, for instance). In such tags, the following postfixes can be used: n or ns for nanoseconds, u or us for microseconds, m or ms for milliseconds, s for seconds, M for minutes, H for hours and D for days.


4.1 XML Section: Trace configuration

The basic trace behavior is determined in the first part of the XML and contains all of the remaining options. It looks like:

<?xml version='1.0'?>

<trace enabled="yes"
 home="@sed_MYPREFIXDIR@"
 initial-mode="detail"
 type="paraver"
 xml-parser-id="@sed_XMLID@"
>

< ... other XML nodes ... >

</trace>

The <?xml version='1.0'?> is mandatory for all XML files. Don't touch this. The available tunable options are under the <trace> node:


See EXTRAE_ON, EXTRAE_HOME, EXTRAE_INITIAL_MODE and EXTRAE_TRACE_TYPE environment variables in appendix B.


4.2 XML Section: MPI

The MPI configuration part is nested in the config file (see section 4.1) and its nodes are the following:

<mpi enabled="yes">
  <counters enabled="yes" />
</mpi>

MPI calls can gather performance information at the begin and end of MPI calls. To activate this behavior, just set to yes the attribute of the nested <counters> node.


See EXTRAE_DISABLE_MPI and EXTRAE_MPI_COUNTERS_ON environment variables in appendix B.


4.3 XML Section: PACX

The PACX configuration part is nested in the config file (see section 4.1) and its nodes are the following:

<pacx enabled="yes">
  <counters enabled="yes" />
</pacx>

PACX calls can gather performance information at the begin and end of PACX calls. To activate this behavior, just set to yes the attribute of the nested <counters> node.


See EXTRAE_DISABLE_PACX and EXTRAE_PACX_COUNTERS_ON environment variables in appendix B.


4.4 XML Section: pthread

The pthread configuration part is nested in the config file (see section 4.1) and its nodes are the following:

<pthread enabled="yes">
  <locks enabled="no" />
  <counters enabled="yes" />
</pthread>

The tracing package allows to gather information of some pthread routines. In addition to that, the user can also enable gathering information of locks and also gathering performance counters in all of these routines. This is achieved by modifying the enabled attribute of the <locks> and <counters>, respectively.


See EXTRAE_DISABLE_PTHREAD, EXTRAE_PTHREAD_LOCKS and EXTRAE_PTHREAD_COUNTERS_ON environment variables in appendix B.


4.5 XML Section: OpenMP

The OpenMP configuration part is nested in the config file (see section 4.1) and its nodes are the following:

<openmp enabled="yes">
  <locks enabled="no" />
  <counters enabled="yes" />
</openmp>

The tracing package allows to gather information of some OpenMP runtimes and outlined routines. In addition to that, the user can also enable gathering information of locks and also gathering performance counters in all of these routines. This is achieved by modifying the enabled attribute of the <locks> and <counters>, respectively.


See EXTRAE_DISABLE_OMP, EXTRAE_OMP_LOCKS and EXTRAE_OMP_COUNTERS_ON environment variables in appendix B.


4.6 XML Section: CELL

The Cell configuration part is only parsed for tracing packages suited for the Cell architecture, and as the rest of sections it is nested in the config file (see section 4.1). The available nodes only affect the SPE side, and they are:

<cell enabled="no">
  <spu-file-size enabled="yes">5</spu-file-size>
  <spu-buffer-size enabled="yes">64</spu-buffer-size>
  <spu-dma-channel enabled="yes">2</spu-dma-channel>
</cell>


See EXTRAE_SPU_FILE_SIZE, EXTRAE_SPU_BUFFER_SIZE and EXTRAE_SPU_DMA_CHANNEL environment variables in appendix B.


4.7 XML Section: Callers

<callers enabled="yes">
  <mpi enabled="yes">1-3</mpi>
  <pacx enabled="no">1-3</pacx>
  <sampling enabled="no">1-5</sampling>
  <dynamic-memory enabled="no">1-5</dynamic-memory>
</callers>

Callers are the routine addresses present in the process stack at any given moment during the application run. Callers can be used to link the tracefile with the source code of the application.

The instrumentation library can collect a partial view of those addresses during the instrumentation. Such collected addresses are translated by the merging process if the correspondent parameter is given and the application has been compiled and linked with debug information.

There are three points where the instrumentation can gather this information:

The user can choose which addresses to save in the trace (starting from 1, which is the closest point to the MPI call or sampling point) specifying several stack levels by separating them by commas or using the hyphen symbol.


See EXTRAE_MPI_CALLER and EXTRAE_PACX_CALLER environment variables in appendix B.


4.8 XML Section: User functions

<user-functions enabled="no"
  list="/home/bsc41/bsc41273/user-functions.dat"
  exclude-automatic-functions="no">
  <counters enabled="yes" />
</user-functions>

The file contains a list of functions to be instrumented by Extrae . There are different alternatives to instrument application functions, and some alternatives provides additional flexibility, as a result, the format of the list varies depending of the instrumentation mechanism used:

The exclude-automatic-functions attribute is used only by the DynInst instrumenter. By setting this attribute to yes the instrumenter will avoid automatically instrumenting the routines that either call OpenMP outlined routines (i.e. routines with OpenMP pragmas) or call CUDA kernels.

Finally, in order to gather performance counters in these functions and also in those instrumented using the extrae_user_function API call, the node counters has to be enabled.


See EXTRAE_FUNCTIONS environment variable in appendix B.


4.9 XML Section: Performance counters

The instrumentation library can be compiled with support for collecting performance metrics of different components available on the system. These components include:

Here is an example of the counters section in the XML configuration file:

<counters enabled="yes">
  <cpu enabled="yes" starting-set-distribution="1">
    <set enabled="yes" domain="all" changeat-time="5s">
      PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L1_DCM
      <sampling enabled="yes" period="100000000">PAPI_TOT_CYC</sampling>
    </set>
    <set enabled="yes" domain="user" changeat-globalops="5">
      PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_INS
    </set>
  </cpu>
  <network enabled="yes" />
  <resource-usage enabled="yes" />
</counters>


See EXTRAE_COUNTERS, EXTRAE_NETWORK_COUNTERS and EXTRAE_RUSAGE environment variables in appendix B.


4.9.1 Processor performance counters

Processor performance counters are configured in the <cpu> nodes. The user can configure many sets in the <cpu> node using the <set> node, but just one set will be used at any given time in a specific task. The <cpu> node supports the starting-set-distribution attribute with the following accepted values:

Each set contain a list of performance counters to be gathered at different instrumentation points (see sections 4.2, 4.5 and 4.8). If the tracing library is compiled to support PAPI, performance counters must be given using the canonical name (like PAPI_TOT_CYC and PAPI_L1_DCM), or the PAPI code in hexadecimal format (like 8000003b and 80000000, respectively)4.3. If the tracing library is compiled to support PMAPI, only one group identifier can be given per set4.4 and can be either the group name (like pm_basic and pm_hpmcount1) or the group number (like 6 and 22, respectively).

In the given example (which refers to PAPI support in the tracing library) two sets are defined. First set will read PAPI_TOT_INS (total instructions), PAPI_TOT_CYC (total cycles) and PAPI_L1_DCM (1st level cache misses). Second set is configured to obtain PAPI_TOT_INS (total instructions), PAPI_TOT_CYC (total cycles) and PAPI_FP_INS (floating point instructions).

Additionally, if the underlying performance library supports sampling mechanisms, each set can be configured to gather information (see section 4.7) each time the specified counter reaches a specific value. The counter that is used for sampling must be present in the set. In the given example, the first set is enabled to gather sampling information every 100M cycles.

Furthermore, performance counters can be configured to report accounting on different basis depending on the domain attribute specified on each set. Available options are

In the given example, first set is configured to count all the events ocurred, while the second one only counts those events ocurred when the application is running in user-space mode.

Finally, the instrumentation can change the active set in a manual and an automatic fashion. To change the active set manually see Extrae_previous_hwc_set and Extrae_next_hwc_set API calls in 5.1. To change automatically the active set two options are allowed: based on time and based on application code. The former mechanism requires adding the attribute changeat-time and specify the minimum time to hold the set. The latter requires adding the attribute changeat-globalops with a value. The tracing library will automatically change the active set when the application has executed as many MPI global operations as selected in that attribute. When In any case, if either attribute is set to zero, then the set will not me changed automatically.


4.9.2 Network performance counters

Network performance counters are only available on systems with Myrinet GM/MX networks and they are fixed depending on the firmware used. Other systems, like BG/* may provide some network performance counters, but they are accessed through the PAPI interface (see section 4.9 and PAPI documentation).

If <network> is enabled the network performance counters appear at the end of the application run, giving a summary for the whole run.


4.9.3 Operating system accounting

Operating system accounting is obtained through the getrusage(2) system call when <resource-usage> is enabled. As network performance counters, they appear at the end of the application run, giving a summary for the whole run.


4.10 XML Section: Storage management

The instrumentation packages can be instructed on what/where/how produce the intermediate trace files. These are the available options:

<storage enabled="no">
  <trace-prefix enabled="yes">TRACE</trace-prefix>
  <size enabled="no">5</size>
  <temporal-directory enabled="yes">/scratch</temporal-directory>
  <final-directory enabled="yes">/gpfs/scratch/bsc41/bsc41273</final-directory>
  <gather-mpits enabled="no" />
</storage>

Such options refer to:


See EXTRAE_PROGRAM_NAME, EXTRAE_FILE_SIZE, EXTRAE_DIR, EXTRAE_FINAL_DIR and EXTRAE_GATHER_MPITS environment variables in appendix B.


4.11 XML Section: Buffer management

Modify the buffer management entry to tune the tracing buffer behavior.

<buffer enabled="yes">
  <size enabled="yes">150000</size>
  <circular enabled="no" />
</buffer>

By, default (even if the enabled attribute is "no") the tracing buffer is set to 500k events (see section 4.6 for further information of buffer in the CELL). If <size> is enabled the tracing buffer will be set to the number of events indicated by this node. If the circular option is enabled, the buffer will be created as a circular buffer and the buffer will be dumped only once with the last events generated by the tracing package.


See EXTRAE_BUFFER_SIZE environment variable in appendix B.


4.12 XML Section: Trace control

<trace-control enabled="yes">
  <file enabled="no" frequency="5M">/gpfs/scratch/bsc41/bsc41273/control</file>
  <global-ops enabled="no">10</global-ops>
  <remote-control enabled="yes">
    <mrnet enabled="yes" target="150" analysis="spectral" start-after="30">
      <clustering max_tasks="26" max_points="8000"/>
      <spectral min_seen="1" max_periods="0" num_iters="3" signals="DurBurst,InMPI"/>
    </mrnet>
    <signal enabled="no" which="USR1"/>
  </remote-control>
</trace-control>

This section groups together a set of options to limit/reduce the final trace size. There are three mechanisms which are based on file existance, global operations executed and external remote control procedures.

Regarding the file, the application starts with the tracing disabled, and it is turned on when a control file is created. Use the property frequency to choose at which frequency this check must be done. If not supplied, it will be checked every 100 global operations on MPI_COMM_WORLD.

If the global-ops tag is enabled, the instrumentation package begins disabled and starts the tracing when the given number of global operations on MPI_COMM_WORLD has been executed.

The remote-control tag section allows to configure some external mechanisms to automatically control the tracing. Currently, there is only one option which is built on top of MRNet and it is based on clustering and spectral analysis to generate a small yet representative trace.

These are the options in the mrnet tag:

The clustering tag configures the clustering analysis parameters:

The spectral tag section configures the spectral analysis parameters:

A signal can be used to terminate the tracing when using the remote control. Available values can be only USR1/USR2 Some MPI implementations handle one of those, so check first which is available to you. Set in tag signal the signal code you want to use.


See EXTRAE_CONTROL_FILE, EXTRAE_CONTROL_GLOPS, EXTRAE_CONTROL_TIME environment variables in appendix B.


4.13 XML Section: Bursts

<bursts enabled="no">
  <threshold enabled="yes">500u</threshold>
  <mpi-statistics enabled="yes" />
  <pacx-statistics enabled="yes" />
</bursts>

If the user enables this option, the instrumentation library will just emit information of computation bursts (i.e., not does not trace MPI calls, OpenMP runtime, and so on) when the current mode (through initial-mode in 4.1) is set to bursts. The library will discard all those computation bursts that last less than the selected threshold.

In addition to that, when the tracing library is running in burst mode, it computes some statistics of MPI and PACX activity. Such statistics can be dumped in the tracefile by enabling mpi-statistics and pacx-statistics respectively.


See EXTRAE_INITIAL_MODE, EXTRAE_BURST_THRESHOLD, EXTRAE_MPI_STATISTICS and EXTRAE_PACX_STATISTICS environment variables in appendix B.


4.14 XML Section: Others

<others enabled="yes">
  <minimum-time enabled="no">10m</minimum-time>
</others>

This section contains other configuration details that do not fit in the previous sections. Right now, there is only one option available and it is devoted to tell the instrumentation package the minimum instrumentation time. To enable it, set enabled to "yes" and set the minimum time within the minimum-time tag.


4.15 XML Section: Sampling

<sampling enabled="no" type="default" period="50m" variability="10m"/>

This section configures the time-based sampling capabilities. Every sample contains processor performance counters (if enabled in section 4.9.1 and either PAPI or PMAPI are referred at configure time) and callstack information (if enabled in section 4.7 and proper dependencies are set at configure time).

This section contains two attributes besides enabled. These are


See EXTRAE_SAMPLING_PERIOD, EXTRAE_SAMPLING_VARIABILITY, EXTRAE_SAMPLING_CLOCKTYPE and EXTRAE_SAMPLING_CALLER environment variables in appendix B.


4.16 XML Section: CUDA

<cuda enabled="yes" />

This section indicates whether the CUDA calls should be instrumented or not. If enabled is set to yes, CUDA calls will be instrumented, otherwise they will not be instrumented.


4.17 XML Section: OpenCL

<opencl enabled="yes" />

This section indicates whether the OpenCL calls should be instrumented or not. If enabled is set to yes, Opencl calls will be instrumented, otherwise they will not be instrumented.


4.18 XML Section: Input/Output

<input-output enabled="no" />

This section indicates whether I/O calls (read and write) are meant to be instrumented. If enabled is set to yes, the aforementioned calls will be instrumented, otherwise they will not be instrumented.

Note: This is an experimental feature, and needs to be enabled at configure time using the -enable-instrument-io option.

Warning: This option seems to intefere with the instrumentation of the GNU and Intel OpenMP runtimes, and the issues haven't been solved yet.


4.19 XML Section: Dynamic memory

<dynamic-memory enabled="no">
  <alloc enabled="yes" threshold="32768" />
  <free  enabled="yes" />
</dynamic-memory>

This section indicates whether dynamic memory calls (malloc, free, realloc) are meant to be instrumented. If enabled is set to yes, the aforementioned calls will be instrumented, otherwise they will not be instrumented. This section allows deciding whether allocation and free-related memory calls shall be instrumented. Additionally, the configuration can also indicate whether allocation calls should be instrumented if the requested memory size surpasses a given threshold (32768 bytes, in the example).

Note: This is an experimental feature, and needs to be enabled at configure time using the -enable-instrument-dynamic-memory option.

Warning: This option seems to intefere with the instrumentation of the Intel OpenMP runtime, and the issues haven't been solved yet.


4.20 XML Section: Memory references through Intel PEBS sampling

<pebs-sampling enabled="yes">
  <loads  enabled="yes" period="1000000" minimum-latency="10" />
  <stores enabled="no"  period="1000000" />
</pebs-sampling>

This section tells Extrae to use the PEBS feature from recent Intel processors4.6 to sample memory references. These memory references capture the linear address referenced, the component of the memory hierarchy that solved the reference and the number of cycles to solve the reference. In the example above, PEBS monitors one out of every million load instructions and only grabs those that require at least 10 cycles to be solved.

Note: This is an experimental feature, and needs to be enabled at configure time using the -enable-pebs-sampling option.


4.21 XML Section: Merge

<merge enabled="yes" 
  synchronization="default"
  binary="mpi_ping"
  tree-fan-out="16"
  max-memory="512"
  joint-states="yes"
  keep-mpits="yes"
  sort-addresses="yes"
  overwrite="yes"
>
  mpi_ping.prv 
</merge>

If this section is enabled and the instrumentation packaged is configured to support this, the merge process will be automatically invoked after the application run. The merge process will use all the resources devoted to run the application.

In the example given, the leaf of this node will be used as the tracefile name (mpi_ping.prv in this example). Current available options for the merge process are given as attribute of the <merge> node and they are:

In Linux systems, the tracing package can take advantage of certain functionalities from the system and can guess the binary name, and from it the tracefile name. In such systems, you can use the following reduced XML section replacing the earlier section.

<merge enabled="yes" 
  synchronization="default"
  tree-fan-out="16"
  max-memory="512"
  joint-states="yes"
  keep-mpits="yes"
  sort-addresses="yes"
  overwrite="yes"
/>


For further references, see chapter 6.


4.22 Using environment variables within the XML file

XML tags and attributes can refer to environment variables that are defined in the environment during the application run. If you want to refer to an environment variable within the XML file, just enclose the name of the variable using the dollar symbol ($), for example: $FOO$.

Note that the user has to put an specific value or a reference to an environment variable which means that expanding environment variables in text is not allowed as in a regular shell (i.e., the instrumentation package will not convert the follwing text bar$FOO$bar).


5. Extrae API

There are two levels of the API in the Extrae instrumentation package. Basic API refers to the basic functionality provided and includes emitting events, source code tracking, changing instrumentation mode and so. Extended API is an experimental addition to provide several of the basic API within single and powerful calls using specific data structures.


5.1 Basic API

The following routines are defined in the ${EXTRAE_HOME}/include/extrae_user_events.h. These routines are intended to be called by C/C++ programs. The instrumentation package also provides bindings for Fortran applications. The Fortran API bindings have the same name as the C API but honoring the Fortran compiler function name mangling scheme. To use the API in Fortran applications you must use the module provided in $EXTRAE_HOME/include/extrae_module.f by using the language clause use. This module which provides the appropriate function and constant declarations for Extrae .


5.2 Extended API

NOTE: This API is in experimental stage and it is only available in C. Use it at your own risk!

The extended API makes use of two special structures located in ${PREFIX}/include/extrae_types.h. The structures are extrae_UserCommunication and extrae_CombinedEvents. The former is intended to encode an event that will be converted into a Paraver communication when its partner equivalent event has found. The latter is used to generate events containing multiple kinds of information at the same time.

struct extrae_UserCommunication
{
  extrae_user_communication_types_t type;
  extrae_comm_tag_t tag;
  unsigned size; /* size_t? */
  extrae_comm_partner_t partner;
  extrae_comm_id_t id;
};

The structure extrae_UserCommunication contains the following fields:

struct extrae_CombinedEvents
{
  /* These are used as boolean values */
  int HardwareCounters;
  int Callers;
  int UserFunction;
  /* These are intended for N events */
  unsigned nEvents;
  extrae_type_t  *Types;
  extrae_value_t *Values;
  /* These are intended for user communication records */
  unsigned nCommunications;
  extrae_user_communication_t *Communications;
};

The structure extrae_CombinedEvents contains the following fields:

The extended API contains the following routines:

5.3 Special considerations for Cell Broadband Engine tracing package

Instead of including ${EXTRAE_HOME}/include/extrae_user_events.h include:


5.3.1 PPE side

The routines shown on section 5.1 are available for the PPE element. In addition, two additional routines are available to control the creation and finalization of the SPE threads. These routines are:


5.3.2 SPE side

Due to the lack of parallel paradigms and hardware counters inside the SPE element, the SPE tracing library is a subset of the typical tracing library. The following API calls are available for the SPE element:


5.4 Java bindings

If Java is enabled at configure time, a basic instrumentation library for serial application based on JNI bindings to Extrae will be installed. The current bindings are within the package es.bsc.cepbatools.extrae and the following bindings are provided:


5.4.1 Advanced Java bindings

Since Extrae does not have features to automatically discover the thread identifier of the threads that run within the virtual machine, there are some calls that allows to do this manually. These calls are, however, intended for expert users and should be avoided whenever possible because their behavior may be highly modified, or even removed, in future releases.


6. Merging process

Once the application has finished, and if the automatic merge process is not setup, the merge must be executed manually. Here we detail how to run the merge process manually.

The inserted probes in the instrumented binary are responsible for gathering performance metrics of each task/thread and for each of them several files are created where the XML configuration file specified (see section 4.10). Such files are:

In order to use Paraver, those intermediate files (i.e., .mpit files) must be merged and translated into Paraver trace file format. The same applies if the user wants to use the Dimemas simulator. To proceed with any of these translation all the intermediate trace files must be merged into a single trace file using one of the available mergers in the bin directory (see table 6.1).

The target trace type is defined in the XML configuration file used at the instrumentation step (see section 4.1), and it has match with the merger used (mpi2prv and mpimpi2prv for Paraver and mpi2dim and mpimpi2dim for Dimemas). However, it is possible to force the format nevertheless the selection done in the XML file using the parameters -paraver or -dimemas6.1.


Table: Description of the available mergers in the Extrae package.
2tabbg1tabbg2
Binary Description
mpi2prv Sequential version of the Paraver merger.
mpi2dim Sequential version of the Dimemas merger.
mpimpi2prv Parallel version of the Paraver merger.
mpimpi2dim Parallel version of the Dimemas merger.


6.1 Paraver merger

As stated before, there are two Paraver mergers: mpi2prv and mpimpi2prv. The former is for use in a single processor mode while the latter is meant to be used with multiple processors using MPI (and cannot be run using one MPI task).

Paraver merger receives a set of intermediate trace files and generates three files with the same name (which is set with the -o option) but differ in the extension. The Paraver trace itself (.prv file) that contains timestamped records that represent the information gathered during the execution of the instrumented application. It also generates the Paraver Configuration File (.pcf file), which is responsible for translating values contained in the Paraver trace into a more human readable values. Finally, it also generates a file containing the distribution of the application across the cluster computation resources (.row file).

The following sections describe the available options for the Paraver mergers. Typically, options available for single processor mode are also available in the parallel version, unless specified.

6.1.1 Sequential Paraver merger

These are the available options for the sequential Paraver merger:

6.1.2 Parallel Paraver merger

These options are specific to the parallel version of the Paraver merger:

6.2 Dimemas merger

As stated before, there are two Dimemas mergers: mpi2dim and mpimpi2dim. The former is for use in a single processor mode while the latter is meant to be used with multiple processors using MPI.

In contrast with Paraver merger, Dimemas mergers generate a single output file with the .dim extension that is suitable for the Dimemas simulator from the given intermediate trace files..

These are the available options for both Dimemas mergers:

6.3 Environment variables

There are some environment variables that are related Two environment variables

6.3.1 Environment variables suitable to Paraver merger

6.3.1.1 EXTRAE_LABELS

This environment variable lets the user add custom information to the generated Paraver Configuration File (.pcf). Just set this variable to point to a file containing labels for the unknown (user) events.

The format for the file is:

  EVENT_TYPE
  0 [type1] [label1]
  0 [type2] [label2]
  ...
  0 [typeK] [labelK]

Where [typeN] is the event value and [labelN] is the description for the event with value [typeN]. It is also possible to link both type and value of an event:

  EVENT_TYPE
  0 [type] [label]
  VALUES
  [value1] [label1]
  [value2] [label2]
  ...
  [valueN] [labelN]

With this information, Paraver can deal with both type and value when giving textual information to the end user. If Paraver does not find any information for an event/type it will shown it in numerical form.

6.3.1.2 MPI2PRV_TMP_DIR

Points to a directory where all intermediate temporary files will be stored. These files will be removed as soon the application ends.

6.3.2 Environment variables suitable to Dimemas merger

6.3.2.1 MPI2DIM_TMP_DIR

Points to a directory where all intermediate temporary files will be stored. These files will be removed as soon the application ends.

7. Extrae On-line User Guide

7.1 Introduction

Extrae On-line is a new module developed for the Extrae tracing toolkit, available from version 3.0, that incorporates intelligent monitoring, analysis and selection of the traced data. This tracing setup is tailored towards long executions that are producing large traces. Applying automatic analysis techniques based on clustering, signal processing and active monitoring, Extrae gains the ability to inspect and filter the data while it is being collected to minimize the amount of data emitted into the trace, while maximizing the amount of relevant information presented.

Extrae On-line has been developed on top of Synapse, a framework that facilitates the deployment of applications that follow the master/slave architecture based on the MRNet software overlay network. Thanks to its modular design, new types of automatic analyses can be added very easily as new plug-ins into the on-line tracing system, just by defining new Synapse protocols.

This document briefly describes the main features of the Extrae On-line module, and shows how it has to be configured and the different options available.

7.2 Automatic analyses

Extrae On-line currently supports three types of automatic analyses: fine-grain structure detection based on clustering techniques, periodicity detection based on signal processing techniques, and multi-experiment analysis based on active monitoring techniques. Extrae On-line has to be configured to apply one of these types of analyses, and then the analysis will be performed periodically as new data is being traced.

7.2.1 Structure detection

This mechanism aims at identifying the fine-grain structure of the computing regions of the program. Applying density-based clustering, this method is able to expose the main performance trends in the computations, and this information is useful to focus the analysis on the zones of real interest. To perform the cluster analysis, Extrae On-line relies on the ClusteringSuite tool7.1.

At each phase of analysis, several outputs are produced:

Subsequent clustering results can be used to study the evolution of the application over time. In order to study how the clusters are evolving, the xtrack tool can be used.

7.2.2 Periodicity detection

This mechanism allows to detect iterative patterns over a wide region of time, and precisely delimit where the iterations start. Once a period has been found, those iterations presenting less perturbations are selected to produce a representative trace, and the rest of the data is basically discarded. The result of applying this mechanism is a compact trace where only the representative iterations are traced in full detail, and for the rest of the execution we can optionally keep summarized information in the form of phase profiles or a ``low resolution'' trace.

Please note that applying this technique to a very short execution, or if no periodicity can be detected in the application, may result in an empty trace depending on the configuration options selected (see Section 7.3).

7.2.3 Multi-experiment analysis

This mechanism employs active measurement techniques in order to simulate different execution scenarios under the same execution. Extrae On-line is able to add controlled interference into the program to simulate different computation loads, network bandwidth, memory congestion and even tuning some configurations of the parallel runtime (currently supports MPI Dynamic Load Balance (DLB) runtime). Then, the application behavior can be studied under different circumstances, and tracking can be used to analyze the impact of these configurations on the program performance. This technique aims at reducing the number of executions necessary to evaluate different parameters and characteristics of your program.


7.3 Configuration

In order to activate the On-line tracing mode, the user has to enable the corresponding configuration section in the Extrae XML configuration file. This section is found under trace-control > remote-control > online. The default configuration is already ready to use:

<online enabled="yes" 
        analysis="clustering" 
        frequency="auto" 
        topology="auto">

The available options for the <online> section are the following:

Depending on the analysis selected, the following specific options become available.

7.3.1 Clustering analysis options

<clustering config="cl.I.IPC.xml"/>

7.3.2 Spectral analysis options

<spectral max_periods="0" num_iters="3" min_seen="0" min_likeness="85">
   <spectral_advanced enabled="no" burst_threshold="80">
      <periodic_zone     detail_level="profile"/>                 
      <non_periodic_zone detail_level="bursts" min_duration="3s"/> 
   </spectral_advanced>
</spectral>

The basic configuration options for the spectral analysis are the following:

Also, some advanced settings are tunable in the <spectral_advanced> section:

7.3.3 Gremlins analysis options

<gremlins start="0" increment="2" roundtrip="no" loop="no"/>


8. Examples

We present here three different examples of generating a Paraver tracefile. First example requires the package to be compiled with DynInst libraries. Second example uses the LD_PRELOAD or LDR_PRELOAD[64] mechanism to interpose code in the application. Such mechanism is available in Linux and FreeBSD operating systems and only works when the application uses dynamic libraries. Finally, there is an example using the static library of the instrumentation package.


8.1 DynInst based examples

DynInst is a third-party instrumentation library developed at UW Madison which can instrument in-memory binaries. It adds flexibility to add instrumentation to the application without modifying the source code. DynInst is ported to different systems (Linux, FreeBSD) and to different architectures8.1 (x86, x86/64, PPC32, PPC64) but the functionality is common to all of them.


8.1.1 Generating intermediate files for serial or OpenMP applications

[frame=single,numbers=left,labelposition=topline,label=run\_dyninst.sh]
#!/bin/sh

export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION
export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib
source ${EXTRAE_HOME}/etc/extrae.sh

## Run the desired program
${EXTRAE_HOME}/bin/extrae -config extrae.xml $*

A similar script can be found in the share/example/SEQ directory in your tracing package directory. Just tune the EXTRAE_HOME environment variable and make the script executable (using chmod u+x). You can either pass the XML configuration file through the EXTRAE_CONFIG_FILE if you prefer instead. Line no. 5 is responsible for loading all the environment variables needed for the DynInst launcher (called extrae) that is invoked in line 8.

In fact, there are two examples provided in share/example/SEQ, one for static (or manual) instrumentation and another for the DynInst-based instrumentation. When using the DynInst instrumentation, the user may add new routines to instrument using the existing function-list file that is already pointed by the extrae.xml configuration file. The way to specify the routines to instrument is add as many lines with the name of every routine to be instrumented.

Running OpenMP applications using DynInst is rather similar to serial codes. Just compile the application with the appropiate OpenMP flags and run as before. You can find an example in the share/example/OMP directory.


8.1.2 Generating intermediate files for MPI applications

MPI applications can also be instrumented using the DynInst instrumentator. The instrumentation is done independently to each spawned MPI process, so in order to execute the DynInst-based instrumentation package on a MPI application, you must be sure that your MPI launcher supports running shell-scripts. The following scripts show how to run the DynInst instrumentator from the MOAB/Slurm queue system. The first script just sets the environment for the job whereas the second is responsible for instrumenting every spawned task.

[frame=single,numbers=left,labelposition=topline,label=slurm\_trace.sh]
#!/bin/bash
# @ initialdir = .
# @ output = trace.out
# @ error =  trace.err
# @ total_tasks = 4
# @ cpus_per_task = 1
# @ tasks_per_node = 4
# @ wall_clock_limit = 00:10:00
# @ tracing = 1

srun ./run.sh ./mpi_ping

The most important thing in the previous script is the line number 11, which is responsible for spawning the MPI tasks (using the srun command). The spawn method is told to execute ./run.sh ./mpi_ping which in fact refers to instrument the mpi_ping binary using the run.sh script. You must adapt this file to your queue-system (if any) and to your MPI submission mechanism (i.e., change srun to mpirun, mpiexec, poe, etc...). Note that changing the line 11 to read like ./run.sh srun ./mpi_ping would result in instrumenting the srun application not mpi_ping.

[frame=single,numbers=left,labelposition=topline,label=run.sh]
#!/bin/bash

export EXTRAE_HOME=@sub_PREFIXDIR@
source ${EXTRAE_HOME}/etc/extrae.sh

# Only show output for task 0, others task send output to /dev/null
if test "${SLURM_PROCID}" == "0" ; then
  ${EXTRAE_HOME}/bin/extrae -config ../extrae.xml $@ > job.out 2> job.err
else
  ${EXTRAE_HOME}/bin/extrae -config ../extrae.xml $@ > /dev/null 2> /dev/null
fi

This is the script responsible for instrumenting a single MPI task. In line number 4 we set-up the instrumentation environment by executing the commands from extrae.sh. Then we execute the binary passed to the run.sh script in lines 8 and 10. Both lines are executing the same command except that line 8 sends all the output to two different files (one for standard output and another for standard error) and line 10 sends all the output to /dev/null.

Please note, this script is particularly adapted to the MOAB/Slurm queue systems. You may need to adapt the script to other systems by using the appropiate environment variables. Particularly, SLURM_PROCID identifies the MPI task id (i.e., the task rank) and may be changed to the proper environemnt variable (PMI_RANK in ParaStation/Torque/MOAB system or MXMPI_ID in systems having Myrinet MX devices, for example).


8.2 LD_PRELOAD based examples

LD_PRELOAD (or LDR_PRELOAD[64] in AIX) interposition mechanism only works for binaries that are linked against shared libraries. This interposition is done by the runtime loader by substituting the original symbols by those provided by the instrumentation package. This mechanism is known to work on Linux, FreeBSD and AIX operating systems, although it may be available on other operating systems (even using different names8.2) they are not tested.

We show how this mechanism works on Linux (or similar environments) in subsection 8.2.1 and on AIX in subsection 8.2.3.


8.2.1 Linux

The following script preloads the libmpitrace library to instrument MPI calls of the application passed as an argument (tune EXTRAE_HOME according to your installation).

[frame=single,numbers=left,labelposition=topline,label=trace.sh]
#!/bin/sh

export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION
export EXTRAE_CONFIG_FILE=extrae.xml
export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so

## Run the desired program
$*

The previous script can be found in the share/example/MPI/ld-preload directory in your tracing package directory. Copy the script to one of your directories, tune the EXTRAE_HOME environment variable and make the script executable (using chmod u+x). Also copy the XML configuration extrae.xml file from the share/example/MPI directory instrumentation package to the current directory. This file is used to configure the whole behavior of the instrumentation package (there is more information about the XML file on chapter 4). The last line in the script, $\$\ast$, executes the arguments given to the script, so as you can run the instrumentation by simply adding the script in between your execution command.

Regarding the execution, if you run MPI applications from the command-line, you can issue the typical mpirun command as:


${MPI_HOME}/bin/mpirun -np N ./trace.sh mpi-app

where, ${MPI_HOME} is the directory for your MPI installation, N is the number of MPI tasks you want to run and mpi-app is the binary of the MPI application you want to run.

However, if you execute your MPI applications through a queue system you may need to write a submission script. The following script is an example of a submission script for MOAB/Slurm queuing system using the aforementioned trace.sh script for an execution of the mpi-app on two processors.

[frame=single,numbers=left,labelposition=topline,label=slurm-trace.sh]
#! /bin/bash
#@ job_name         = trace_run
#@ output           = trace_run%j.out
#@ error            = trace_run%j.out
#@ initialdir       = .
#@ class            = bsc_cs
#@ total_tasks      = 2
#@ wall_clock_limit = 00:30:00

srun ./trace.sh mpi_app

If your system uses LoadLeveler your job script may look like:

[frame=single,numbers=left,labelposition=topline,label=ll.sh]
#! /bin/bash
#@ job_type = parallel
#@ output = trace_run.ouput
#@ error = trace_run.error
#@ blocking = unlimited
#@ total_tasks = 2
#@ class = debug
#@ wall_clock_limit = 00:10:00
#@ restart = no
#@ group = bsc41 
#@ queue

export MLIST=/tmp/machine_list ${$}
/opt/ibmll/LoadL/full/bin/ll_get_machine_list > ${MLIST}
set NP = `cat ${MLIST} | wc -l`

${MPI_HOME}/mpirun -np ${NP} -machinefile ${MLIST} ./trace.sh ./mpi-app

rm ${MLIST}

Besides the job specification given in lines 1-11, there are commands of particular interest. Lines 13-15 are used to know which and how many nodes are involved in the computation. Such information information is given to the mpirun command to proceed with the execution. Once the execution finished, the temporal file created on line 14 is removed on line 19.


8.2.2 CUDA

There are two ways to instrument CUDA applications, depending on how the package was configured. If the package was configure with -enable-cuda only interposition on binaries using shared libraries are available. If the package was configured with -with-cupti any kind of binary can be instrumented because the instrumentation relies on the CUPTI library to instrument CUDA calls. The example shown below is intended for the former case.

[frame=single,numbers=left,labelposition=topline,label=run.sh]
#!/bin/bash

export EXTRAE_HOME=/home/harald/extrae
export PAPI_HOME=/home/harald/aplic/papi/4.1.4

EXTRAE_CONFIG_FILE=extrae.xml LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:${PAPI_HOME}/lib:${LD_LIBRARY_PATH} ./hello
${EXTRAE_HOME}/bin/mpi2prv -f TRACE.mpits -e ./hello

In this example, the hello application is compiled using the nvcc compiler and linked against the -lcudatrace library. The binary contains calls to Extrae_init and Extrae_fini and then executes a CUDA kernel. Line number 6 refers to the execution of the application itself. The Extrae configuration file and the location of the shared libraries are set in this line. Line number 7 invokes the merge process to generate the final tracefile.


8.2.3 AIX

AIX typically ships with POE and LoadLeveler as MPI implementation and queue system respectively. An example for a system with these software packages is given below. Please, note that the example is intended for 64 bit applications, if using 32 bit applications then LDR_PRELOAD64 needs to be changed in favour of LDR_PRELOAD.

[frame=single,numbers=left,labelposition=topline,label=ll-aix64.sh]
#@ job_name = basic_test
#@ output = basic_stdout
#@ error = basic_stderr
#@ shell = /bin/bash
#@ job_type = parallel
#@ total_tasks = 8
#@ wall_clock_limit = 00:15:00
#@ queue

export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION
export EXTRAE_CONFIG_FILE=extrae.xml
export LDR_PRELOAD64=${EXTRAE_HOME}/lib/libmpitrace.so

./mpi-app

Lines 1-8 contain a basic LoadLeveler job definition. Line 10 sets the Extrae package directory in EXTRAE_HOME environment variable. Follows setting the XML configuration file that will be used to set up the tracing. Then follows setting LDR_PRELOAD64 which is responsible for instrumentation using the shared library libmpitrace.so. Finally, line 14 executes the application binary.


8.3 Statically linked based examples

This is the basic instrumentation method suited for those installations that neither support DynInst nor LD_PRELOAD, or require adding some manual calls to the Extrae API.


8.3.1 Linking the application

To get the instrumentation working on your code, first you have to link your application with the Extrae libraries. There are installed examples in your package distribution under share/examples directory. There you can find MPI, OpenMP, pthread and sequential examples depending on the support at configure time.

Consider the example Makefile found in share/examples/MPI/static:

[frame=single,numbers=left,labelposition=topline,label=Makefile]
MPI_HOME = /gpfs/apps/MPICH2/mx/1.0.7..2/64
EXTRAE_HOME = /home/bsc41/bsc41273/foreign-pkgs/extrae-11oct-mpich2/64
PAPI_HOME = /gpfs/apps/PAPI/3.6.2-970mp-patched/64
XML2_LDFLAGS = -L/usr/lib64
XML2_LIBS = -lxml2

F77 = $(MPI_HOME)/bin/mpif77 
FFLAGS = -O2
FLIBS = $(EXTRAE_HOME)/lib/libmpitracef.a \
        -L$(PAPI_HOME)/lib -lpapi -lperfctr \
        $(XML2_LDFLAGS) $(XML2_LIBS)

all: mpi_ping

mpi_ping: mpi_ping.f
	$(F77) $(FFLAGS) mpi_ping.f $(FLIBS) -o mpi_ping

clean:
	rm -f mpi_ping *.o pingtmp? TRACE.*

Lines 2-5 are definitions of some Makefile variables to set up the location of different packages needed by the instrumentation. In particular, EXTRAE_HOME sets where the Extrae package directory is located. In order to link your application with Extrae you have to add its libraries in the link stage (see lines 9-11 and 16). Besides libmpitracef.a we also add some PAPI libraries (-lpapi, and its dependency (which you may or not need -lperfctr), the libxml2 parsing library (-lxml2), and finally, the bfd and liberty libraries (-lbfd and -liberty), if the instrumentation package was compiled to support merge after trace (see chapter 3 for further information).


8.3.2 Generating the intermediate files

Executing an application with the statically linked version of the instrumentation package is very similar as the method shown in Section 8.2. There is, however, a difference: do not set LD_PRELOAD in trace.sh.

[frame=single,numbers=left,labelposition=topline,label=trace.sh]
#!/bin/sh

export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION
export EXTRAE_CONFIG_FILE=extrae.xml
export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:\
                       /gpfs/apps/MPICH2/mx/1.0.7..2/64/lib:\
                       /gpfs/apps/PAPI/3.6.2-970mp-patched/64/lib

## Run the desired program
$*

See section 8.2 to know how to run this script either through command line or queue systems.


8.4 Generating the final tracefile

Independently from the tracing method chosen, it is necessary to translate the intermediate tracefiles into a Paraver tracefile. The Paraver tracefile can be generated automatically (if the tracing package and the XML configuration file were set up accordingly, see chapters 3 and 4) or manually. In case of using the automatic merging process, it will use all the resources allocated for the application to perform the merge once the application ends.

To manually generate the final Paraver tracefile issue the following command:


${EXTRAE_HOME}/bin/mpi2prv -f TRACE.mpits -e mpi-app -o trace.prv

This command will convert the intermediate files generated in the previous step into a single Paraver tracefile. The TRACE.mpits is a file generated automatically by the instrumentation and contains a reference to all the intermediate files generated during the execution run. The -e parameter receives the application binary mpi-app in order to perform translations from addresses to source code. To use this feature, the binary must have been compiled with debugging information. Finally, the -o flag tells the merger how the Paraver tracefile will be named (trace.prv in this case).


A. An example of Extrae XML configuration file

<?xml version='1.0'?>

<trace enabled="yes"
 home="@sed_MYPREFIXDIR@"
 initial-mode="detail"
 type="paraver"
 xml-parser-id="@sed_XMLID@"
>
  <mpi enabled="yes">
    <counters enabled="yes" />
  </mpi>

  <pacx enabled="no">
    <counters enabled="yes" />
  </pacx>

  <pthread enabled="yes">
    <locks enabled="no" />
    <counters enabled="yes" />
  </pthread>

  <openmp enabled="yes">
    <locks enabled="no" />
    <counters enabled="yes" />
  </openmp>

  <callers enabled="yes">
    <mpi enabled="yes">1-3</mpi>
    <pacx enabled="no">1-3</pacx>
    <sampling enabled="no">1-5</sampling>
  </callers>

  <user-functions enabled="no"
    list="/home/bsc41/bsc41273/user-functions.dat"
    exclude-automatic-functions="no">
    <counters enabled="yes" />
  </user-functions>

  <counters enabled="yes">
    <cpu enabled="yes" starting-set-distribution="1">
      <set enabled="yes" domain="all" changeat-globalops="5">
        PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L1_DCM
        <sampling enabled="no" period="100000000">PAPI_TOT_CYC</sampling>
      </set>
      <set enabled="yes" domain="user" changeat-globalops="5">
        PAPI_TOT_INS,PAPI_FP_INS,PAPI_TOT_CYC
      </set>
    </cpu>
    <network enabled="yes" />
    <resource-usage enabled="yes" />
  </counters>

  <storage enabled="no">
    <trace-prefix enabled="yes">TRACE</trace-prefix>
    <size enabled="no">5</size>
    <temporal-directory enabled="yes">/scratch</temporal-directory>
    <final-directory enabled="yes">/gpfs/scratch/bsc41/bsc41273</final-directory>
    <gather-mpits enabled="no" />
  </storage>

  <buffer enabled="yes">
    <size enabled="yes">150000</size>
    <circular enabled="no" />
  </buffer>

  <trace-control enabled="yes">
    <file enabled="no" frequency="5M">/gpfs/scratch/bsc41/bsc41273/control</file>
    <global-ops enabled="no">10</global-ops>
    <remote-control enabled="yes">
      <mrnet enabled="yes" target="150" analysis="spectral" start-after="30">
        <clustering max_tasks="26" max_points="8000"/>
        <spectral min_seen="1" max_periods="0" num_iters="3" signals="DurBurst,InMPI"/>
      </mrnet>
      <signal enabled="no" which="USR1"/>
    </remote-control>
  </trace-control> 

  <others enabled="yes">
    <minimum-time enabled="no">10m</minimum-time>
  </others>

  <bursts enabled="no">
    <threshold enabled="yes">500u</threshold>
    <mpi-statistics enabled="yes" />
    <pacx-statistics enabled="no" />
  </bursts>

  <cell enabled="no">
    <spu-file-size enabled="yes">5</spu-file-size>
    <spu-buffer-size enabled="yes">64</spu-buffer-size>
    <spu-dma-channel enabled="yes">2</spu-dma-channel>
  </cell>

  <sampling enabled="no" type="default" period="50m" variability="10m"/>

	<opencl enabled="no" />

	<cuda enabled="no" />

  <merge enabled="yes" 
    synchronization="default"
    binary="mpi_ping"
    tree-fan-out="16"
    max-memory="512"
    joint-states="yes"
    keep-mpits="yes"
    sort-addresses="yes"
    overwrite="yes"
  >
    mpi_ping.prv 
  </merge>

</trace>


B. Environment variables

Although Extrae is configured through an XML file (which is pointed by the EXTRAE_CONFIG_FILE), it also supports minimal configuration to be done via environment variables for those systems that do not have the library responsible for parsing the XML files (i.e., libxml2).

This appendix presents the environment variables the Extrae package uses if EXTRAE_CONFIG_FILE is not set and a description. For those environment variable that refer to XML 'enabled' attributes (i.e., that can be set to "yes" or "no") are considered to be enabled if their value are defined to 1.


\begin{landscape}
% latex2html id marker 4474\begin{table}
\centerline{
\small...
...riables available to configure {\sf {E}xtrae}\ }
\end{table}\par
\end{landscape}


\begin{landscape}
% latex2html id marker 4570\par
\begin{table}
\centerline{
\...
... to configure {\sf {E}xtrae}\ ({\em continued})}
\end{table}\par
\end{landscape}


C. Frequently Asked Questions

C.1 Configure, compile and link FAQ

C.2 Execution FAQ

C.3 Performance monitoring counters FAQ

C.4 Merging traces FAQ


D. Instrumented run-times


D.1 MPI

These are the instrumented MPI routines in the Extrae package:


D.2 OpenMP

D.2.1 Intel compilers - icc, iCC, ifort

The instrumentation of the Intel OpenMP runtime for versions 8.1 to 10.1 is only available using the Extrae package based on DynInst library.

These are the instrument routines of the Intel OpenMP runtime functions using DynInst:

The instrumentation of the Intel OpenMP runtime for version 11.0 to 12.0 is available using the Extrae package based on the LD_PRELOAD and also the DynInst mechanisms. The instrumented routines include:

D.2.2 IBM compilers - xlc, xlC, xlf

Extrae supports IBM OpenMP runtime 1.6.

These are the instrumented routines of the IBM OpenMP runtime:

D.2.3 GNU compilers - gcc, g++, gfortran

Extrae supports GNU OpenMP runtime 4.2.

These are the instrumented routines of the GNU OpenMP runtime:


D.3 pthread

These are the instrumented routines of the pthread runtime:


D.4 CUDA

These are the instrumented CUDA routines in the Extrae package:

The CUDA accelerators do not have memory for the tracing buffers, so the tracing buffer resides in the host side. Typically, the CUDA tracing buffer is flushed at cudaThreadSynchronize, cudaStreamSynchronize and cudaMemcpy calls, so it is possible that the tracing buffer for the device gets filled if no calls to this routines are executed.


D.5 OpenCL

These are the instrumented OpenCL routines in the Extrae package:

The OpenCL accelerators have small amounts of memory, so the tracing buffer resides in the host side. Typically, the accelerator tracing buffer is flushed at each cl_Finish call, so it is possible that the tracing buffer for the accelerator gets filled if no calls to this routine are executed. However if the operated OpenCL command queue is tagged as not Out-of-Order, then flushes will also happen at clEnqueueReadBuffer, clEnqueueReadBufferRect and clEnqueueMapBuffer if their corresponding blocking parameter is set to true.

About this document ...

Extrae
User guide manual
for version 3.1.1rc

This document was generated using the LaTeX2HTML translator Version 2008 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -show_section_numbers -nonumbered_footnotes user-guide

The translation was initiated by Harald Servat on 2015-07-20


Footnotes

... file1.1
See section 4 for further details regarding this file
... libmpitrace1.2
If the application is Fortran append an f to the library. For example, if you want to instrument a Fortran application that is using MPI, use libmpitracef instead of libmpitrace.
... PAPI4.1
More information available on their website http://icl.cs.utk.edu/papi. Extrae requires PAPI 3.x at least.
... PMAPI4.2
PMAPI is only available for AIX operating system, and it is on the base operating system since AIX5.3. Extrae requires AIX 5.3 at least.
... respectively)4.3
Some architectures do not allow grouping some performance counters in the same set.
... set4.4
Each group contains several performance counters
... file4.5
This check is done each time the buffer is flushed, so the resulting size of the intermediate trace file depends also on the number of elements contained in the tracing buffer (see section 4.11).
... processors4.6
Check for availability on your system by looking for pebs in /proc/cpuinfo.
...-dimemas6.1
The timing mechanism differ in Paraver/Dimemas at the instrumentation level. If the output trace format does not correspond with that selected in the XML some timing inaccuracies may be present in the final tracefile. Such inaccuracies are known to be higher due to clock granularity if the XML is set to obtain Dimemas tracefiles but the resulting tracefile is forced to be in Paraver format.
... tool7.1
You can download it from http://www.bsc.es/computer-sciences/performance-tools/downloads.
... architectures8.1
The IA-64 architecture support was dropped by DynInst 7.0
... names8.2
Look at http://www.fortran-2000.com/ArnaudRecipes/sharedlib.html for further information.
Harald Servat 2015-07-20