Home
Overview
News/Events
Tutorials/
Documentation
Libraries/
Building Executables
Environment Variables
SMT and
OpenMP
Performance Results
Open Issues, Gotchas, and Recent Changes
FAQs
Glossary
Contacts |
|
- How do I diagnose and/or debug an
MPI problem where the code hangs?
- Do we have installation for C++ binding
for mpi++ (on AIX)?
- What causes the .mpirun[process num]
files to appear (on AIX)?
- Is there any limitation on how many
processes my MPI job can fork on a system?
- I am using mpigather and receiving
error "trying to receive a message when there are no connections." Why?
- What do these errors mean? "MPIRUN
chose the wrong device ch_shmem; program needs device ump2", "ump2main.c227
"Internal Error, The magic is missing." (on GPS machine)
- I am receiving a run-time error: "mpi
Invalid communicator error." Why?
- What is the maximum number of MPI
tasks a job may have? (on Purple, uP, UM, UV machines)
- I don't understand this error message:
"MPI: INTERNAL ERROR catalog was closed, or catalog was not initialized." (on
IBM machines)
- I don't understand this error message:
MPID Die - ump2main.c:453 "ump_init failure." (on SC cluster)
- Fortran compiler "mpiifc"
does not include a link to "mpi-io" libraries. (on ALC, MCR)
- I get segmentation fault when I compile
with 64-bit mpi. (on Purple, uP, UM, UV)
- What does this warning message mean,
and how can I eliminate it: "weak symbol multiply defined"?
- Are there any web pages or other
documents that might provide "stack size information and strategy" for
the users?
- What does this error mean? "semget
failed for setnum = 0 Abort." (on ILX)
- How can I determine which version
of MPICH is running on a machine?
- What is simulataneous multithreading?
- What are the expected performance
gains for SMT-capable systems?
- I have heard people refer to "the
hypervisor" on the Purple and uP machines. What is the hypervisor. Do all
LLNL machines have hypervisors?
Q: How do I diagnose and/or debug an MPI problem
where the code hangs?
A:Try these steps:
- Try to run with one process; is the error still present? If not, try running
with two processes; is the error still present?
- Try running the job under the debugger (TotalView). When you think the job
is hung, interact with the debugger to determine where the hang is occurring
(i.e., what part of your code or MPI is involved). For example, are you in a
loop sending messages (infinitely), or are you hung because you are waiting for
an event that doesn't appear to happen, such as a message receive?
- Check your MPI environment variables (env | grep
MP_). Are these the right settings for your MPI use?
If none of these things help, give the LC-Hotline more details on what your
code is doing, such as:
- Are you using the vendor's MPI or MPICH?
- Size of messages and number of messages being sent?
- Does this same job run successfully if you run it as a batch job? interactive
job? Are you setting different environment variables with a batch version than
with the interactive version?
- Does your job read input interactively? Is there a message number prefixing
the error message, and what is it? (e.g., 032-xxxx or some such form).
Top
Q: Do we have installation for C++ binding for
mpi++ (on AIX)?
A: The IBM MPI supports C++. There is no mpi++, but there are C++ interfaces
that are provided for C++ codes that make MPI calls. At the present time, IBM
supports only the C compatibility for C++, not the C++ interfaces that were added
for MPI-2. You need to #include <mpi.h>.
You also need to load with the MPI library, which is automatic when you use the
mpCC (C++) MPI script for compiling/linking with the IBM MPI library.
Q: What causes the .mpirun[process num] files to
appear (on AIX)?
A: Using mpirun to launch IBM MPI jobs is the reason for the .mpirun[process
number] files being created. You should be using poe instead. mpirun is for launching
MPICH MPI jobs, not a general purpose parallel job launcher.
Q: Is there any limitation on how many processes
my MPI job can fork on a system?
A: There are no limits other than those imposed by the batch system.
Q: I am using mpigather and receiving error "trying
to receive a message when there are no connections." Why?
A: You were calling MPI_Barrier with MPI_COMM_WORLD, but not all processors
executed this command, so it couldn't get communication from all processors.
Top
Q: What do these errors mean? "MPIRUN chose
the wrong device ch_shmem; program needs device ump2", "ump2main.c227
"Internal Error, The magic is missing". (on GPS machine)
A: This happens when a user is mixing and matching the Compaq MPI with
MPICH. Check to see if you have an explicit -lmpi being
loaded. This would (inadvertently) add the Compaq MPI library, and probably before
the -lmpich that would be loaded automatically
by the mpicc script. MPI
Libraries/Building Executables explains that there are two different MPIs
available on GPS. One is the Compaq MPI, whose executables must be run with dmpirun.
The other is MPICH, whose executables must be built and run with the standard
MPICH scripts (mpicc and mpirun). It sounds like if you want the MPICH version,
you should recompile and load with the mpicc (or mpiCC or mpif90 or whatever)
script, and then run your executable with mpirun. (By the way, the default MPICH
on GPS is a shared-memory version; there is also a P4 version.) Please note that
the Compaq MPI provides optimizations for the Alphas that MPICH cannot, but you
must run the executable with dmpirun to get the correct initialization of the
MPI environment.
Q: I am receiving a run-time error: "mpi Invalid
communicator error". Why?
A: This sort of error occurs when there is a mixture of MPICH header files
with IBM's MPI libraries, or vice versa. As a first step, which version of MPI
do you wish to use on White? The IBM MPI is the recommended version. If you are
using MPICH, it is recommended you use the MPICH scripts (mpicc, mpirun, etc.)
that provide the correct -I and -L paths
and the correct libraries. It is possible that you are using explicit -I or -L options
that are no longer valid; this could result in locating the incorrect header
files.
Q: What is the maximum number of MPI tasks a job
may have? (on Purple, uP, UM, UV machines)
A: Up to 4096 User Space tasks. Up to 2048 IP tasks.
Top
Q: I don't understand this error message: "MPI:
INTERNAL ERROR catalog was closed, or catalog was not initialized". (on
IBM machines)
A: Compile with the magic -binitfinipoe_remote_mainlinker flag
that you need for POE applications. This will give informative error messages
that will indicate any linking problems.
Q: I don't understand this error message: MPID
Die - ump2main.c:453 "ump_init failure" (on SC cluster)
A: This means that not enough shared memory is available. Run mpiclean,
then execute your parallel code again. If the problem persists, try running on
another node of the cluster. If the problem still persists, call the LC-Hotline
and ask them to have the shared-memory cleaned up.
Q: Fortran compiler "mpiifc" does not
include a link to "mpi-io" libraries. (on ALC, MCR)
A: This is a known issue. We are still waiting for the vendor (Quadrics)
to provide a Fortran library libfmpi.a with the mpiio routines included. In the
meantime, users can modify their code to link with the C library libmpi.a which
includes the mpiio routines.
Top
Q: I get segmentation fault when I compile with
64-bit mpi. (on Purple, uP, UM, UV)
A: To compile with 64-bit mpi:
- Compile with flags -q64 -qwarn64. This
will tell you about all the illegal conversions from int to pointer and back.
- Add -brtl -L... after -q64 for
the link line only (the -brtl can screw
up normal compiles to object files).
- Do not use the flags -bmaxdata or -bmaxstack in
64-bit mode compilations. In 32-bit mode, these give you more memory. In 64-bit
mode, they restrict your memory usage. In 64-bit mode, the default is unlimited.
- Set environment variable: setenv OBJECT_MODE 64
- You cannot mix 32-bit and 64-bit items, so make sure your entire code has
been compiled with these options. If you are loading any of your own libraries,
they too must be compiled with the 64-bit options.
- Do not explicitly include -lxlf90 -lm -lc in
your link line. These should not be necessary, and it is possible for this to
cause problems (usually link problems). We recommend taking out these unless
there is a good reason to have them.
- Caveat: Most 64-bit codes that get a segmentation fault on White are not
prototyping the malloc routine. Your best bet would be to run TotalView on the
executable and see if the segmentation fault happens in C code. I suspect that
a C library (which you link with) is calling malloc. Look for all C routines
that deal with memory allocation and add # include <stdlib.h> in
them. In 32-bit mode, malloc/calloc/etc. works properly even if stdlib.h is not
included. In 64-bit mode, not having the prototype causes the pointer to get
corrupted, resulting in seg faults. On other 64-bit platforms such as IRIX and
Tru64, they eventually modified their compilers to automatically prevent this
type of error for malloc/calloc/etc. (sort of an automatic prototyping) because
of all the problems caused. Make sure a prototype for array_alloc() that returns
a pointer is visible from everywhere it is being used. Any function that returns
a pointer but doesn't have a prototype visible will cause problems. The compiler
will do the wrong thing every time otherwise.
Top
Q: What does this warning message mean, and how
can I eliminate it: "weak symbol multiply defined"?
A: The -w option to mpiCC is passed to
g++, and it suppresses compilation warning messages only. To suppress the load
(multiply-defined symbol) messages, you should try -Wl,-s.
This will pass the -s to the loader (ld). If
the -Wl,-s option is not satisfactory, you could
also try the g++ option to turn off weak symbol support, -fno-weak.
Q: Are there any web pages or other documents
that might provide "stack size information and strategy" for the users?
A: Read Jeff Fier's excellent summary of thread
stack usage.
Q: What does this error mean? "semget failed
for setnum = 0 Abort" (on ILX)
A: This error means that not enough shared memory is available. There
are memory segments left over from a crash that need to be cleaned up. Users
should run mpiclean to clean up any memory segments that may have been left on
the node. If the problem persists, contact the LC-Hotline.
Q: How can I determine which version of MPICH
is running on a machine?
A: To determine which version of MPICH is running on a machine, you may
use the -compile_info or -link_info options
to any of the MPICH compilation scripts, such as mpicc. For example, typing mpicc
-compile_info identifies the default version of MPICH used on the ILX
machine as 1.2.4. On all elan-based clusters (ALC, MCR, Pengra, Emperor, Adelie,
and Lilac), we run the same version of MPI. This MPI (provided by Quadrics) is
version 1.24-8 and is based on MPICH 1.2.4.
Q: What is simultaneous multithreading?
A: Simultaneous multithreading (SMT) is not hard to understand. In traditional
designs, the entire collection of functional units in the CPU belong to one process
at a time. A process can therefore be executing instructions in some or all of
the functional units of the processor and nobody else can be using it at that
instance. With a feature that we called hardware multithreading in the RS64 line
of processors a few years ago, we provided additional hardware resources that
allowed two processes to have their state, essentially, on chip. When a process
had a cache miss that would normally stall it, it would switch to the other process,
the other thread, with a three-cycle pause. So this was still only one process
executing at a time, but could switch back and forth between two of them very,
very rapidly.
In SMT we have widened the data path somewhat to allow a thread indicator
on each instruction. So, we actually can fetch from two different instruction
streams and have instructions from two different instruction streams issuing
simultaneously to the different functional units on the chip.
We currently support two threads on the system, and it is a very general-purpose
mechanism. You can have instructions from different threads in different pipeline
stages of the same functional unit just following each other through. And it
provides the ability to use the hardware, the processor functional units much
more efficiently.
[Adapted from text provided by John McCalpin, IBM]
Q: What are the expected performance gains for
SMT-capable systems?
A: It varies from negative, in some cases, to up to 60% in some cases.
So, it is not at all unusual to see 20 to 30% speedup on applications without
doing anything special.
[Adapted from text provided by John McCalpin, IBM]
Q: I have heard people refer to "the Hypervisor" on
the Purple and uP machines. What is the Hypervisor. Do all LLNL machines have
Hypervisors?
A: The Hypervisor is software present on IBM POWER5-based machines. Traditionally,
the operating system's job is to provide an interface between the user and the
hardware and to provide protection so that the user cannot access any part of
the hardware in an uncontrolled way. In current directions, especially in server
consolidation projects, one finds that you want to run multiple operating systems
on the same piece of hardware, especially with all of the operating system exploits
and security problems that are happening.
So, what IBM has done is added a new
operating system, essentially, called the Hypervisor, that sits between the operating
systems and the hardware. The Hypervisor is modestly complicated, but it's enough
smaller than an operating system that one can have a lot more confidence about
its reliability. And now when the operating system wants to interact with the
hardware, it has to do it only through the Hypervisor or with the permission
of the Hypervisor. So that you can have, for example, multiple Linux kernels
running on the same hardware, and even if there is a security problem in Linux
and someone compromises the kernel, that kernel is still prevented from interfering
with any of the other partitions on the machine.
[Adapted from text provided
by John McCalpin, IBM]
Top
|