NAME
CORE - File or directory produced by faulting process(es).
SYNOPSIS
include <sys/core.h>
DESCRIPTION
The operating system writes a core file when a process in an
application terminates with an error. A process that ter-
minates with an error is called a faulting process. The fol-
lowing support is provided for core files:
o Core files are created for parallel and non-parallel
applications.
o Environment variables control where and how core files are
generated.
o The coreinfo command displays information about core
files.
The most common errors that generate core files for applica-
tions are memory violations, illegal instructions, bus
errors, and user-generated quit signals. The following sig-
nals may result in core files being written: SIGQUIT,
SIGILL, SIGTRAP, SIGABRT, SIGFPE, SIGBUS, SIGSEGV, SIGSYS,
SIGIOT, and SIGEMT. See the signal(4) manual page for more
information.
If a faulting process is in a regular non-parallel applica-
tion, the operating system writes a core file named core. If
a faulting process is in a parallel application, the follow-
ing happens by default:
o The entire application is terminated.
o A directory named core is created.
o In the core directory, the operating system creates a core
file for the first faulting process. The core file gets a
name with the following form: core.pid, where pid is the
process ID (PID) for the process.
o In the core directory, the operating system writes the
file named allocinfo containing global information about
the parallel application.
If a core file or directory exists at the time of a fault,
it is silently removed (if permissions allow) and a new one
created. A message is displayed only if there is insuffi-
cient file space to write the core file or if the core
directory was created but cannot be written into due to
access permission.
If the core action environment variables are set such that
more than one faulting process dumps core, the bootmagic
string PARACORE_MAX_PROCESSES limits how many processes dump
core.
SPECIFYING THE LOCATION OF CORE FILES
By default, the core file or directory is created in the
current working directory for the application. The environ-
ment variable CORE_PATH can be used to change this default
location.
CORE_PATH
Directory pathname where the core file or direc-
tory is created. The default is the working direc-
tory in which you execute an application.
Since it is possible to change directories or to change an
environment variable from within an application, the follow-
ing rules apply as to their effect on the placement of core:
o For non-parallel applications, any change of directory or
environment variable (for example, chdir(), or putenv())
that occurs prior to the fault can effect the location of
core.
o For parallel applications (compiled with -nx), any change
of directory or environment variable within the applica-
tion does not effect the location of core.
o For host/node applications (compiled with -lnx), any
change of directory or environment variable that is per-
formed by the host prior to a fault in any process can
effect the location of core.
The following example specifies that a directory
/usr/develop/corefiles is where core files or directories
should be placed:
% setenv CORE_PATH /usr/develop/corefiles
NOTE
Writing core files to a PFS directory is not allowed.
If CORE_PATH specifies a directory in a PFS file sys-
tem, a core file or core-file directory will NOT be
created when an application faults.
DEFINING CORE FILE TYPES, FREQUENCIES, AND
ACTIONS
More than one process in a parallel application may ter-
minate with an error. This happens because there is a period
of time between when the first faulting process terminates
and the other processes in the application terminate. You
can use the following environment variables to specify how
and when core files are created:
CORE_ACTION_FIRST
Specifies how the operating system handles the
first faulting process in a parallel or
non-parallel application.
CORE_ACTION_FAULT
Specifies how the operating system handles fault-
ing processes other than the first faulting pro-
cess in a parallel application.
CORE_ACTION_OTHER
Specifies how the operating system handles
non-faulting processes in a parallel application.
You can specify the following values for the environment
variables CORE_ACTION_FIRST, CORE_ACTION_FAULT, and
CORE_ACTION_OTHER:
FULL Creates a full core file that includes the entire
data region. Default action for first faulting
process (both parallel and non-parallel applica-
tions).
TRACE Creates a partial core that includes the register
and stack information only.
KILL Stops the application without creating core files.
Default action for faulting (except the first
faulting process) and non-faulting processes.
You can also specify the following value for the environment
variable CORE_ACTION_OTHER:
CONT Continue executing. Do not stop or create core
files.
The following example specifies how core files or core-file
directories are created:
% setenv CORE_ACTION_FIRST FULL
% setenv CORE_ACTION_FAULT TRACE
% setenv CORE_ACTION_OTHER CONT
This example specifies creating a full core file for the
first faulting process in an application, a core file with
trace information only for other faulting files, and no core
files for non-faulting processes.
The default values for the environment variables maximize
the debug information available and minimize the file space
needed for core files. The environment variables allow you
to adjust the amount of core dumped based on your knowledge
of the application size (number of processes used) and the
file space available.
Use the following guidelines when setting the environment
variables for core files or directories:
o The CONT action can be specified only for
CORE_ACTION_OTHER. If CONT is specified for either
CORE_ACTION_FIRST or CORE_ACTION_FAULT, the error will not
be caught until an application faults. When the applica-
tion faults, a warning is displayed on the console and the
default action for each core action environment variable
is assumed.
o If CORE_ACTION_FIRST, CORE_ACTION_FAULT, and
CORE_ACTION_OTHER are all set to not create core files
(that is KILL or CONT), then no core file or core direc-
tory is created and an existing one is untouched.
o For parallel applications, combinations where
CORE_ACTION_FIRST is set to KILL but CORE_ACTION_FAULT and
CORE_ACTION_OTHER are not set to KILL may result in a core
directory being created with an allocinfo file and nothing
else. For example, if CORE_ACTION_FIRST and
CORE_ACTION_OTHER are set to KILL and CORE_ACTION_FAULT is
set to FULL when a single process in an application
faults, a core directory is created with nothing but an
allocinfo file.
o For parallel applications executing on a system that has
the PARACORE_MAX_PROCESSES bootmagic string set to 1 (the
default), only CORE_ACTION_FIRST is effective. The system
ignores CORE_ACTION_FAULT and CORE_ACTION_OTHER. Under
these conditions, the first faulting process is the only
process that can dump core.
o For parallel applications executing on a system that has
the PARACORE_MAX_PROCESSES bootmagic string set to a value
greater than one, the system applies CORE_ACTION_FAULT and
CORE_ACTION_OTHER only when the application size is less
than or equal to the bootmagic string's setting.
o For parallel applications executing on a system that has
the PARACORE_MAX_PROCESSES bootmagic string set to -1
(unlimited), the system applies CORE_ACTION_FIRST,
CORE_ACTION_FAULT and CORE_ACTION_OTHER as described in
the first four bulleted text items.
MAXIMUM SIZE FOR CORE FILES
The maximum size of each individual core file is limited by
the setrlimit() function. Core files that exceed the limit
are not created.
EXAMINING CORE FILES
You can use the coreinfo command and the IPD tool to examine
core files. The coreinfo command displays summary informa-
tion about processes that have dumped core. The IPD tool
displays stack tracebacks and data stored in core files.
See the coreinfo(1) and ipd manual pages for more informa-
tion about this command and tool.
CORE FILE FORMAT
The format and a brief description of the content of a core
file is described below. Refer to the indicated include
files for structure details.
Core header
Identifies the core file, the process that gen-
erated the file (relative to the partition), and
the offset from the start of the core file to the
beginning of each of the other sections in the
core file. The following structure is defined in
the include file core.h.
struct core_hdr {
int c_magic; /* core file magic number */
unsigned short c_swap; /* byte swap field */
short c_version; /* core file version */
int c_type; /* core file type */
struct timeval c_timdat; /* time & date core file created */
int c_signo; /* signal that killed process */
int c_sigcode; /* signal subcode */
long c_numnodes; /* number of nodes in the partition */
long c_node; /* logical node number of the dumping process */
long c_ptype; /* ptype of the dumping process */
off_t c_procinfo; /* offset of process info */
off_t c_applinfo; /* offset of application info (APPLINFO_T) */
long c_nregions; /* number of region descriptors */
off_t c_firstreg; /* offset of first region desc in core file */
long c_nthreads; /* number of active threads */
off_t c_firstthread; /* offset of first thread_info in core file */
long c_activethread; /* index of last active (faulting) thread */
};
Process information
Identifies the process (relative to other
processes), the executable, and its arguments. The
program name is stored as it is given to the exec
system call, thus, the root and current directory
information is needed so that a full path name can
be constructed. Offsets are relative to the start
of the core file. The following structure is
defined in the include file core.h.
struct core_proc_info {
pid_t c_pid; /* process id */
pid_t c_ppid; /* parent process id */
pid_t c_pgid; /* process group leader id */
long c_prglen; /* length of program path */
off_t c_prgname; /* offset of relative path name of program */
long c_rootlen; /* length of exec root directory path */
off_t c_rootname; /* offset of exec root directory path */
long c_cwdlen; /* length of exec current directory path */
off_t c_cwdname; /* offset of exec current directory path */
};
Application information
Identifies the execution control characteristics.
All other application information and partition
information is contained in a separate file called
allocinfo written in the core directory. The fol-
lowing structure is defined in the include file
mcmsg/mcmsg_appl.h.
typedef struct applinfo {
unsigned long app; /* application id */
unsigned long process_lock; /* lock process data in memory */
unsigned long pkt_size; /* message packet size */
unsigned long memory_buffer; /* total message buffer to allocate */
unsigned long memory_export; /* total buffer for other nodes */
unsigned long memory_each; /* buffer available for each node */
unsigned long send_threshold; /* send multiple packet threshold */
unsigned long send_count; /* pkts to send when send_threshold */
unsigned long give_threshold; /* send give message threshold */
unsigned long noc; /* number of correspondents */
unsigned long rows; /* number of rows in application */
unsigned long columns; /* number of columns in application */
unsigned long unused[12];
} APPLINFO_T;
Region descriptors
Describes all regions for a process. The offset
value is null if the contents of the region was
not written to the region section of the core
file. The following structure is defined in the
include file core.h.
struct core_region_desc {
off_t r_offset; /* offset of VM region in core file */
vm_address_t r_vaddr; /* virtual address of region start */
vm_size_t r_size; /* region size (bytes) */
vm_prot_t r_prot; /* VM protection (e.g. VM_PROT_READ) */
};
Thread information
Contains state (register) information for each
Mach thread in the process. The structure consists
of a list of all the registers and is defined in
mach/machine/thread_status.h.
Region contents
Contains memory image of regions described in the
region descriptors section.
ALLOCINFO FILE FORMAT
Information for the entire application that is not locally
available to a node's server is written to the file allo-
cinfo within the core directory. The following structures
show the format and content of the allocinfo file. These
structures are defined in the include file allocinfo.h.
struct allocinfo_hdr {
int a_magic; /* allocinfo file magic number */
unsigned short a_swap; /* byte swap field */
short a_version; /* allocinfo file version */
nx_part_info_t a_partition; /* NX partition info */
nx_app_info_t a_application; /* NX application info */
};
typedef struct {
uid_t uid; /* User Id */
gid_t gid; /* Group Id */
int access; /* Access Permissions */
int sched; /* NX_STD or NX_GANG */
unsigned long rq; /* Rollin Quantum */
int epl; /* Effective priority limit */
int nodes; /* Number of nodes in the partition */
/* NOTES: mesh_x and mesh_y are only set if the mesh is a contiguous */
/* rectangle. Otherwise the are -1.*/
int mesh_x; /* X dimension of partition */
int mesh_y; /* Y dimension of partition */
/* The enclose_mesh_x and enclose_mesh_y are the
* minimum rectangular dimensions that will enclose
* the partition. These dimensions may contain nodes enclose
* nodes that are not part of the specified partition
*/
int enclose_mesh_x;
int enclose_mesh_y;
int flags_or_size; /* Internal Use only */
int part_id; /* Internal Use only */
int free; /* Internal Use only */
int reserved[7];
} nx_part_info_t;
typedef struct {
int size; /* Number of nodes in application */
int nrows; /* X dimension of application, 1 if
* nodes are not contiguous
*/
int ncols; /* Y dimension of application,set to size
* if nodes are not contiguous
*/
int priority /* Priority of application */
unsigned long rolled_in; /* Milliseconds this appl rolled in */
unsigned long elapsed; /* Milliseconds this appl rolled in */
uid_t uid; /* UID of user running application */
gid_t nx_acctid; /* NX account id (MACS)
* of application */
struct tm start_time; /* Time stamp of when
* application started
*/
} nx_app_info_t;
LIMITATIONS AND WORKAROUNDS
There is no means of limiting the size of a core directory
as a whole.
Once an application faults and core file creation begins, it
cannot be interrupted.
SEE ALSO
commands: coreinfo, bootmagic, pspart, nx_pspart
OSF/1 Programmer's Reference: setrlimit(2), signal(4)
Acknowledgement and Disclaimer