1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402 |
- .\" Copyright (C) 1994-2018 Altair Engineering, Inc.
- .\" For more information, contact Altair at www.altair.com.
- .\"
- .\" This file is part of the PBS Professional ("PBS Pro") software.
- .\"
- .\" Open Source License Information:
- .\"
- .\" PBS Pro is free software. You can redistribute it and/or modify it under the
- .\" terms of the GNU Affero General Public License as published by the Free
- .\" Software Foundation, either version 3 of the License, or (at your option) any
- .\" later version.
- .\"
- .\" PBS Pro is distributed in the hope that it will be useful, but WITHOUT ANY
- .\" WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
- .\" FOR A PARTICULAR PURPOSE.
- .\" See the GNU Affero General Public License for more details.
- .\"
- .\" You should have received a copy of the GNU Affero General Public License
- .\" along with this program. If not, see <http://www.gnu.org/licenses/>.
- .\"
- .\" Commercial License Information:
- .\"
- .\" For a copy of the commercial license terms and conditions,
- .\" go to: (http://www.pbspro.com/UserArea/agreement.html)
- .\" or contact the Altair Legal Department.
- .\"
- .\" Altair’s dual-license business model allows companies, individuals, and
- .\" organizations to create proprietary derivative works of PBS Pro and
- .\" distribute them - whether embedded or bundled with other software -
- .\" under a commercial license agreement.
- .\"
- .\" Use of Altair’s trademarks, including but not limited to "PBS™",
- .\" "PBS Professional®", and "PBS Pro™" and Altair’s logos is subject to Altair's
- .\" trademark licensing policies.
- .\"
- .TH pbs_mom 8B "8 November 2017" Local "PBS Professional"
- .SH NAME
- .B pbs_mom
- - run the PBS job monitoring and execution daemon
- .SH SYNOPSIS
- .B pbs_mom
- [-a <alarm timeout>]
- [-C <checkpoint directory>]
- .RS 8
- [-c <config file>]
- [-d <MoM home directory>]
- [-L <logfile>]
- .br
- [-M <MoM service port>]
- [-N]
- [-n <nice value>]
- [-p|-r]
- .br
- [-R <inter-MoM communication port>]
- [-S <server port>]
- .br
- [-s script_options]
- .RE
- .B pbs_mom
- --version
- .SH DESCRIPTION
- The
- .B pbs_mom
- command starts the PBS job monitoring and execution daemon, called
- MoM.
- The standard MoM starts jobs on the execution host, monitors and reports
- resource usage, enforces resource usage limits, and notifies the
- server when the job is finished. The MoM also runs any prologue
- scripts before the job runs, and runs any epilogue scripts after the
- job runs.
- The MoM performs any communication with job tasks and with other MoMs.
- The MoM on the first vnode on which a job is running manages
- communication with the MoMs on the remaining vnodes on which the job
- runs.
- The MoM manages one or more vnodes. PBS may treat a host as
- a set of virtual nodes, in which case one MoM
- manages all of the host's vnodes. See the
- .B PBS Professional Administrator's Guide.
- .B Logging
- .br
- The MoM's log file is in PBS_HOME/mom_logs. The MoM writes an
- error message in its log file when it encounters any error. If it
- cannot write to its log file, it writes to standard error. The
- MoM writes events to its log file.
- The MoM writes its PBS
- version and build information to the logfile whenever it starts up or
- the logfile is rolled to a new file.
- .B Required Permission
- .br
- The executable for
- .B pbs_mom
- is in PBS_EXEC/sbin, and can be run only by root on Linux, and Admin
- on Windows.
- .B Cpusets
- .br
- A cpusetted machine can have a "boot cpuset" defined by the
- administrator. A boot cpuset contains one or more CPUs and memory
- boards and is used to restrict the default placement of system
- processes, including login. If defined, the boot cpuset contains
- CPU 0.
- Run parallel jobs exclusively within a cpuset for repeatability of
- performance. HPE SGI states, "Using cpusets on an HPE SGI system improves
- cache locality and memory access times and can substantially improve
- an application's performance and runtime repeatability."
- The CPUSET_CPU_EXCLUSIVE flag prevents CPU 0 from being used by
- the MoM in the creation of job cpusets. This flag is set by default,
- so this is the default behavior.
- To find out which cpuset is assigned to a running job, use
- .B qstat
- -f
- to see the
- .I cpuset
- field in the job's
- .I altid
- attribute.
- .B HPE SGI Machine Running Supported Versions of Performance Software -
- Message Passing Interface
- .br
- The cpusets created for jobs are marked cpu-exclusive.
- MoM does not use any CPU which was in use at startup.
- A PBS job can run across multiple machines that run supported versions
- of Performance Software - Message Passing Interface.
- PBS can run using HPE SGI's MPI (MPT) over InfiniBand. See the
- .B PBS Professional Administrator's Guide.
- .LP
- .B Effect on Jobs of Starting MoM
- .br
- When MoM is started or restarted, her default behavior is to leave
- any running processes running, but to tell the PBS server to requeue
- the jobs she manages. MoM tracks the process ID of jobs across
- restarts.
- In order to have all jobs killed and requeued, use the
- .I r
- option when starting or restarting MoM.
- In order to leave any running processes running, and not to requeue
- any jobs, use the
- .I p
- option when starting or restarting MoM.
- .SH OPTIONS
- .IP "-a <alarm timeout>" 10
- Number of seconds before alarm timeout.
- Whenever a resource request is processed, an alarm is set for the
- given amount of time. If the request has not completed before
- .I alarm timeout,
- the OS generates an alarm signal and sends it to MoM.
- Default: 10 seconds. Format: integer.
- .IP "-C <checkpoint directory>" 10
- Specifies the path of the directory where MoM creates job-specific
- subdirectories used to hold each job's restart files. MoM passes this
- path to checkpoint and restart scripts. Overrides other checkpoint
- path specification methods. Any directory specified with the
- .I -C
- option must be owned, readable, writable, and executable by root only
- .I (rwx,---,---, or 0700),
- to protect the security of the checkpoint files. See the
- .I -d
- option. Format: string.
- .br
- Default: PBS_HOME/spool/checkpoint.
- .IP "-c <config file>" 10
- MoM will read this alternate default configuration file upon starting.
- If this is a relative file name it will be relative to
- PBS_HOME/mom_priv. If the specified file cannot be opened,
- .B pbs_mom
- will abort. See the
- .I -d
- option.
- MoM's normal operation, when the -c option is not given, is to attempt
- to open the default configuration file PBS_HOME/mom_priv/config.
- If this file is not present,
- .B pbs_mom
- will log the fact and continue.
- .IP "-d <MoM home directory>" 10
- Specifies the path of the
- .I directory
- to be used in place of PBS_HOME by
- .B pbs_mom.
- The default directory is given by $PBS_HOME. Format: string.
- .IP "-L <logfile>" 10
- Specifies an absolute path and filename for the log file.
- The default is a file named for the current date in PBS_HOME/mom_logs/.
- See the
- .I -d
- option. Format: string.
- .IP "-M <MoM port>" 10
- Specifies the number of the port on which MoM will
- listen for server requests and instructions. Overrides
- PBS_MOM_SERVICE_PORT setting in pbs.conf and environment variable.
- Default: 15002.
- Format: integer port number.
- .IP "-n <nice value>" 10
- Specifies the priority for the
- .B pbs_mom
- daemon. Format: integer.
- .IP "-N" 10
- Specifies that when starting, MoM should not detach from the
- current session.
- .IP "-p" 10
- Specifies that when starting, MoM should allow any running jobs
- to continue running, and not have them requeued. This option
- can be used for single-host jobs only; multi-host jobs cannot
- be preserved.
- Cannot be used with the
- .I -r
- option.
- MoM is not the parent of these jobs.
- .RS
- .IP "HPE SGI systems running Performance Software - Message Passing Interface" 5
- The cpuset-enabled
- .B pbs_mom
- will, if given the
- .I -p
- flag, use the existing CPU and memory allocations for the /PBSPro
- cpusets.
- The default behavior is to remove these cpusets.
- Should this fail, MoM will exit, asking to be restarted with the
- .I -p
- flag.
- .LP
- .RE
- .IP "-r" 10
- Specifies that when starting, MoM should requeue any rerunnable jobs and
- kill any non-rerunnable jobs that
- she was tracking, and mark the
- jobs as terminated. Cannot be used with the
- .I -p
- option.
- MoM is not the parent of these jobs.
- It is not recommended to use the
- .I -r
- option after a reboot, because process IDs of new, legitimate tasks
- may match those MoM was previously tracking. If they match and MoM is
- started with the
- .I -r
- option, MoM will kill the new tasks.
- .IP "-R <inter-MoM communication port>" 10
- Specifies the number of the port on which MoM will listen for pings,
- resource information requests, communication from other MoMs, etc.
- Overrides PBS_MANAGER_SERVICE_PORT setting in pbs.conf and environment variable.
- Default: 15003. Format: integer port number.
- .IP "-S <server port>" 10
- Specifies the port number on which
- .B pbs_mom
- initially contacts the server. Default: 15001. Format: integer port number.
- .IP "-s <script options>" 5
- This option provides an interface that allows the administrator to
- add, delete, and display MoM's configuration files. See
- .B CONFIGURATION FILES.
- The
- .I script options
- are used this way:
- .RS
- .IP "-s insert <scriptname> <inputfile>" 5
- Reads
- .I inputfile
- and inserts its contents in a new site-defined
- .B pbs_mom
- configuration file with the filename
- .I scriptname.
- If a
- site-defined configuration file with the name
- .I scriptname
- already exists,
- the operation fails, a diagnostic is presented, and
- .B pbs_mom
- exits with a nonzero status. Scripts whose names begin with
- the prefix "PBS" are reserved. An attempt to add a script
- whose name begins with "PBS" will fail.
- .B pbs_mom will print a diagnostic message and exit
- with a nonzero status. Example:
- .B pbs_mom -s insert <scriptname> <inputfile>
- .IP "-s remove <scriptname>" 5
- The configuration file named
- .I scriptname
- is removed
- if it exists. If the given name does not exist or if an
- attempt is made to remove a script with the reserved "PBS"
- prefix, the operation fails, a diagnostic is presented, and
- .B pbs_mom
- exits with a nonzero status. Example:
- .B pbs_mom -s remove <scriptname>
- .IP "-s show <scriptname>" 5
- Causes the contents of the named script to be printed to
- standard output. If
- .I scriptname
- does not exist, the
- operation fails, a diagnostic is presented, and
- .B pbs_mom
- exits with a nonzero status. Example:
- .B pbs_mom -s show <scriptname>
- .IP "-s list" 5
- Causes
- .B pbs_mom
- to list the set of PBS-prefixed and site-defined configuration
- files in the order in which they are executed. Example:
- .B pbs_mom -s list
- .LP
- .B WINDOWS:
- .RS 5
- Under Windows, the
- .I -N
- option must be used, so that
- .B pbs_mom
- will start up as a standalone
- program. For example:
- .B pbs_mom -N -s insert <scriptname> <inputfile>
- or
- .B pbs_mom -N -s list
- .RE
- .RE
- .LP
- .IP "--version" 10
- The
- .B pbs_mom
- command returns its PBS version information and exits.
- This option can only be used alone.
- .SH CONFIGURATION FILES
- MoM's configuration information can be contained in configuration
- files of three types:
- .I default, PBS-prefixed,
- and
- .I site-defined.
- The
- default configuration file is usually PBS_HOME/mom_priv/config. The
- "PBS" prefix is reserved for files created by PBS. Site-defined
- configuration files are those created by the site administrator.
- MoM reads the configuration files at startup and reinitialization.
- The files are processed in this order:
- .br
- The default configuration file
- .br
- PBS-prefixed configuration files
- .br
- Site-defined configuration files
- .br
- The contents of a file read later override the contents of a file read earlier.
- For example, to change the cpuset flags, create a script "update_flags"
- containing only
- .RS 4
- .B cpuset_create_flags <new flags>
- .RE
- then use the
- .I -s insert
- option:
- .RS 4
- .B pbs_mom -s insert update_script update_flags
- .RE
- This adds the configuration file "update_script".
- Configuration files can be added, deleted and displayed using
- the
- .I -s
- option.
- MoM's configuration files can use either the syntax shown
- below under
- .B Default Syntax and Contents
- or the syntax for describing
- .I vnodes
- shown in
- .B Vnode Syntax.
- .B Location
- .br
- The default configuration file is in PBS_HOME/mom_priv/.
- PBS places PBS-prefixed and site-defined configuration files
- in an area that is private to each installed instance of PBS.
- This area is relative to the default PBS_HOME. Note that the
- .I -d
- option changes where MoM looks for PBS_HOME.
- The
- .I -c
- option will change which default configuration file MoM reads.
- Site-defined configuration files can be moved from one installed
- instance of PBS to another. Do not move PBS-prefixed configuration
- files. To move a set of site-defined configuration files from one
- installed instance of PBS to another:
- .IP "1" 5
- Use the
- .I -s list
- directive with the "source" instance of PBS to enumerate the
- site-defined files.
- .IP "2" 5
- Use the
- .I -s show
- directive with each site-defined file of the "source" instance of PBS
- to save a copy of that file.
- .IP "3" 5
- Use the
- .I -s insert
- directive with each file at the "target" instance of PBS
- to create a copy of each site-defined configuration file.
- .LP
- .B Vnode Configuration File Syntax and Contents
- .br
- Configuration files with the following syntax describe vnodes and
- the resources available on them. They do not contain initialization
- values for MoM.
- See the
- .B PBS Professional Administrator's Guide
- for a definition of
- .I vnodes.
- PBS-prefixed configuration files use the following syntax. Other
- configuration files can use the following syntax.
- Any configuration file containing vnode-specific assignments must
- begin with this line:
- .RS 4
- .B $configversion 2
- .RE
- The format a file containing vnode information is:
- .RS 4
- .I <ID> : <ATTRNAME> = <ATTRVAL>
- .RE
- where
- .RS 4
- .IP "<ID>" 12
- sequence of characters not including a colon (":")
- .IP "<ATTRNAME>" 12
- sequence of characters beginning with alphabetics or numerics, which
- can contain underscore ("_") and dash ("-")
- .IP "<ATTRVAL>" 12
- sequence of characters not including an equal sign ("=")
- .LP
- The colon and equal sign may be surrounded by spaces.
- .RE
- A vnode's
- .I ID
- is an identifier that will be unique across all
- vnodes known to a given
- .B pbs_server
- and will be be stable across
- reinitializations or invocations of
- .B pbs_mom.
- ID stability is
- of importance when a vnode's CPUs or memory might be expected
- to change over time and PBS is expected to adapt to such changes
- by resuming suspended jobs on the same vnodes to which they
- were originally assigned. Vnodes for which this is not a
- consideration may simply use IDs of the form "0", "1", etc.
- concatenated with some identifier that ensures uniqueness across
- the vnodes served by the
- .B pbs_server.
- A
- .I natural vnode
- does not correspond to any actual hardware. It is used to define
- any placement set information that is invariant for a given host,
- such as pnames.
- It is defined as
- follows:
- .br
- .IP "" 5
- The name of the natural vnode is, by convention,
- the MoM contact name, which is usually the hostname.
- The MoM contact name is the vnode's MoM attribute. See the
- .B pbs_node_attributes(7B) man page.
- .IP "" 5
- An attribute, "pnames", with value set to the list of
- resource names that define the placement sets' types for
- this machine.
- .IP "" 5
- An attribute, "sharing" is set to the value "force_shared"
- .LP
- The
- .I natural vnode
- is used to define any placement set information that is invariant for
- a given host (e.g. the placement set resource names themselves).
- The order of the
- .I pnames
- attribute follows placement set organization. If
- name X appears to the left of name Y in this attribute's value, an
- entity of type X may be assumed to be smaller (that is, be
- capable of containing fewer vnodes) than one of type Y. No such
- guarantee is made for specific instances of the types.
- For example, on an HPE SGI machine named "HostA", with two vnodes, a natural
- vnode, four processors and two cbricks, the description would
- look like this:
- .br
- HostA: pnames = cbrick
- .br
- HostA: sharing = force_shared
- .br
- HostA[001c02#0]: sharing = default_excl
- .br
- HostA[001c02#0]: resources_available.ncpus = 2
- .br
- HostA[001c02#0]: resources_available.cbrick = cbrick-0
- .br
- HostA[001c02#0]: resources_available.mem = 1968448kb
- .br
- HostA[001c04#0]: sharing = default_excl
- .br
- HostA[001c04#0]: resources_available.ncpus = 2
- .br
- HostA[001c04#0]: resources_available.cbrick = cbrick-1
- .br
- HostA[001c04#0]: resources_available.mem = 1961328kb
- .br
- The natural vnode is described in the first two lines.
- The first vnode uses cbrick-0, and the second one uses cbrick-1.
- .B Default Syntax and Contents
- .br
- Configuration files with this syntax list local resources and
- initialization values for MoM. Local resources are either static,
- listed by name and value, or externally-provided, listed by name and
- command path. See the
- .I -c
- option.
- Each configuration item is listed on a single line, with its parts
- separated by white space. Comments begin with a hashmark ("#").
- The default configuration file must be secure. It must be owned by a user ID
- and group ID both less than 10 and must not be world-writable.
- .B Externally-provided Resources
- .br
- Externally-provided resources use a shell escape to run a command.
- These resources are described with a name and value,
- where the first character of the value is an exclamation mark ("!").
- The remainder of the value is the path and command to execute.
- Parameters in the command beginning with a percent sign ("%") can
- be replaced when the command is executed.
- For example, this line in a configuration file describes a
- resource named "escape":
- .RS 14
- escape !echo 0xx %yyy
- .RE
- .IP
- If a query for the "escape" resource is sent with no parameter replacements,
- the command executed would be "echo 0xx %yyy". If one parameter replacement is sent,
- "escape[xxx=hi there]", the command executed would be "echo hi there %yyy".
- If two parameter replacements are sent, "escape[xxx=hi][yyy=there]", the command
- executed would be "echo hi there". If a parameter replacement is sent with
- no matching token in the command line, "escape[zzz=snafu]", an error
- is reported.
- .LP
- .B Windows Notes
- .br
- If the argument to a MoM option is a pathname containing a space,
- enclose it in double quotes as in the following:
- hostn !"\\Program Files\\PBS Pro\\exec\\bin\\hostn" host
- When you edit any PBS configuration file, make sure that you put a
- newline at the end of the file. The Notepad application does not
- automatically add a newline at the end of a file; you must explicitly
- add the newline.
- .B Replacing Actions
- .br
- .IP "$action <default action> <timeout> <new action>" 5
- Replaces the
- .I default action
- for an event with the site-specified
- .I new action.
- .I timeout
- is the time allowed for
- .I new action
- to run. See
- .B The PBS Professional Administrator's Guide.
- The
- .I default action
- can be one of:
- .RS
- .IP "checkpoint" 5
- Run
- .I new action
- in place of the periodic job checkpoint, after which the job
- continues to run.
- .IP "checkpoint_abort" 5
- Run
- .I new action
- to checkpoint the job, after which the job must be terminated by the script.
- .IP "multinodebusy <timeout> requeue" 5
- Used with cycle harvesting and multi-vnode jobs.
- Changes default behavior when a vnode becomes busy. Instead of
- allowing the job to run, the job is requeued.
- .I timeout
- is ignored. The only
- .I new action
- is
- .I requeue.
- .IP "restart" 5
- Runs
- .I new action
- in place of
- .I restart.
- .IP "terminate" 5
- Runs
- .I new action
- in place of SIGTERM or SIGKILL when MoM terminates a job.
- .RE
- .SH Initialization Values
- Initialization value directives have names beginning with a
- dollar sign ("$").
- See
- .B The PBS Professional Administrator's Guide.
- .IP "$alps_client <path>" 5
- Cray only. MoM runs this command to get the ALPS inventory. Must
- be full path to command.
- .br
- Format: path to command
- .br
- Default: None
- .IP "$alps_release_timeout <timeout>" 5
- Cray only. Specifies the amount of time that PBS tries to release an
- ALPS reservation before giving up. After this amount of time has
- passed, PBS stops trying to release the ALPS reservation, the job
- exits, and the job's rsources are released. PBS sends a HUP to the
- MoM so that she re-reads the ALPS inventory to get the current
- available ALPS resources.
- .br
- We recommend that the value for this parameter be greater than the value for
- .I suspectbegin.
- .br
- Format: Seconds, specified as positive integer.
- .br
- Default: 600 (10 minutes)
- .IP "$checkpoint_path <path>" 5
- MoM passes this path to checkpoint and restart scripts.
- This path can be absolute or relative to PBS_HOME/mom_priv.
- Overrides default. Overridden by
- .I pbs_mom -C
- option and by
- .I PBS_CHECKPOINT_PATH
- environment variable.
- .IP "$clienthost <hostname>" 5
- .I hostname
- is added to the list of hosts which are allowed
- to connect to MoM as long as they are using a privileged port.
- For example,
- this allows the hosts "fred" and "wilma"
- to connect to MoM:
- .br
- "$clienthost fred"
- .br
- "$clienthost wilma"
- .br
- The following hostnames are added to
- .I $clienthost
- automatically: the
- server, the localhost, and if configured, the secondary server. The
- server sends each MoM a list of the hosts in the nodes file, and these
- are added internally to
- .I $clienthost.
- None of these hostnames need to
- be listed in the configuration file.
- Two hostnames are always allowed to connect to
- .B pbs_mom,
- "localhost" and the name returned to MoM
- by the system call gethostname(). These
- hostnames do not need to be listed in the configuration file.
- The hosts listed
- as "clienthosts" make up a "sisterhood" of machines. Any one of the
- sisterhood will accept connections from within the
- sisterhood. The sisterhood must all use the same port number.
- .IP "$cpuset_error_action <action>" 5
- When using a cpuset-enabled MoM, specifies the action taken when
- a cpuset creation error occurs. Can take one of the following values:
- .RS 5
- .IP continue 3
- The error is logged and the job is killed and requeued.
- .IP offline 3
- The vnodes on this host for this job are marked
- .I offline,
- and the job is requeued.
- .LP
- .br
- Format: String
- .br
- Allowable values: "continue", "offline"
- .br
- Default: "offline"
- .RE
- .IP "$cputmult <factor>" 5
- This sets a
- .I factor
- used to adjust CPU time used by each job. This allows adjustment of time
- charged and limits enforced where jobs run on a system with
- different CPU performance. If MoM's system is faster than the
- reference system, set
- .I factor
- to a decimal value greater than 1.0. For example:
- .RS 9
- $cputmult 1.5
- .RE
- .IP
- If MoM's system is slower, set
- .I factor
- to a value between 1.0 and 0.0. For example:
- .RS 9
- $cputmult 0.75
- .RE
- .IP
- .IP "$dce_refresh_delta <delta>" 5
- Defines the number of seconds between successive refreshings of a job's
- DCE login context.
- For example:
- .RS 9
- $dce_refresh_delta 18000
- .RE
- .IP
- .IP "$enforce <limit>" 5
- MoM will enforce the given
- .I limit.
- Some
- .I limits
- have associated values. Syntax:
- .br
- .I $enforce <variable name> <value>
- .br
- See
- .B The PBS Professional Administrator's Guide.
- .RS
- .IP "$enforce mem" 5
- MoM will enforce each job's memory limit.
- .IP "$enforce cpuaverage" 5
- MoM will enforce ncpus when the average CPU usage over a job's
- lifetime usage is greater than the job's limit.
- .RS
- .IP "$enforce average_trialperiod <seconds>" 5
- Modifies
- .I cpuaverage.
- Minimum number of
- .I seconds
- of job walltime before enforcement begins. Default: 120.
- Integer.
- .IP "$enforce average_percent_over <percentage>" 5
- Modifies
- .I cpuaverage.
- Gives
- .I percentage
- by which a job may exceed its ncpus limit. Default: 50.
- Integer.
- .IP "$enforce average_cpufactor <factor>" 5
- Modifies
- .I cpuaverage.
- The ncpus limit is multiplied by
- .I factor
- to produce actual
- limit. Default: 1.025. Float.
- .RE
- .IP "$enforce cpuburst" 5
- MoM will enforce the ncpus limit when CPU burst usage exceeds
- the job's limit.
- .RS
- .IP "$enforce delta_percent_over <percentage>" 5
- Modifies
- .I cpuburst.
- Gives
- .I percentage
- over limit to be allowed. Default: 50. Integer.
- .IP "$enforce delta_cpufactor <factor>" 5
- Modifies
- .I cpuburst.
- The ncpus limit is multiplied by
- .I factor
- to produce actual limit. Default: 1.5. Float.
- .IP "$enforce delta_weightup <factor>" 5
- Modifies
- .I cpuburst.
- Weighting factor for smoothing burst usage when average is increasing. Default: 0.4.
- Float.
- .IP "$enforce delta_weightdown <factor>" 5
- Modifies
- .I cpuburst.
- Weighting factor
- for smoothing burst usage when average is decreasing. Default: 0.4.
- Float.
- .RE
- .RE
- .IP "$ideal_load <load>" 5
- Defines the
- .I load
- below which the vnode is not considered to be busy.
- Used with
- the
- .I $max_load
- directive.
- No default. Float.
- .RS
- .IP "Example:" 5
- $ideal_load 1.8
- .LP
- .br
- Use of $ideal_load adds a static resource to the vnode called "ideal_load",
- which is only internally visible.
- .LP
- .RE
- .IP "$jobdir_root <stage directory root>
- Directory under which PBS creates job-specific staging and execution directories.
- PBS creates a job's staging and execution directory when the job's
- .I sandbox
- attribute is set to PRIVATE. If
- .I $jobdir_root
- is unset, it defaults to the job owner's home directory.
- In this case the user's home directory must exist.
- If
- .I stage_directory_root
- does not exist when MoM starts up, MoM will abort. If
- .I stage directory root
- does not exist when MoM tries to run a job, MoM will kill the job.
- Path must be owned by root, and permissions must be 1777. On Windows,
- this directory should have Full Control Permission for the local
- Administrators group.
- .RS
- .IP "Example:" 5
- $jobdir_root /scratch/foo
- .RE
- .IP "$kbd_idle <idle wait> <min use> <poll interval>" 5
- Declares that the vnode will be used for batch jobs during periods when
- the keyboard and mouse are not in use.
- The vnode must be idle for a minimum of
- .I idle wait
- seconds before being considered available for batch jobs.
- No default. Integer.
- The vnode must be in use for a minimum of
- .I min use
- seconds before it becomes unavailable for batch jobs. Default: 10. Integer.
- Mom checks for activity every
- .I poll interval
- seconds. Default: 1. Integer.
- .RS
- .IP "Example:" 5
- $kbd_idle 1800 10 5
- .RE
- .IP "$logevent <mask>" 5
- Sets the
- .I mask
- that determines which event types are logged by
- .B pbs_mom.
- To include all debug events, use 0xffffffff.
- .nf
- Log events:
- Name Hex Value Message Category
- ---------------------------------------------------
- ERROR 0001 Internal errors
- SYSTEM 0002 System errors
- ADMIN 0004 Administrative events
- JOB 0008 Job-related events
- JOB_USAGE 0010 Job accounting info
- SECURITY 0020 Security violations
- SCHED 0040 Scheduler events
- DEBUG 0080 Common debug messages
- DEBUG2 0100 Uncommon debug messages
- RESV 0200 Reservation-related info
- DEBUG3 0400 Rare debug messages
- DEBUG4 0800 Limit-related messages
- .fi
- .IP "$max_check_poll <seconds>" 5
- Maximum time between polling cycles, in seconds. Minimum recommended
- value: 30 seconds. See the
- .B PBS Professional Administrator's Guide
- for usage.
- The interval between each poll starts at
- .I $min_check_poll
- and increases with each cycle until it reaches
- .I $max_check_poll,
- after which it remains the same. The amount by which the cycle increases is 1/20 of
- the difference between
- .I $max_check_poll
- and
- .I $min_check_poll.
- .br
- Format: Integer
- .br
- Minimum value: 1 second
- .br
- Default value: 120 seconds
- .IP "$max_load <load> [suspend]" 5
- Defines the load above which the vnode is considered to be busy.
- Used with
- the
- .I $ideal_load
- directive. No new jobs are started on a
- .I busy
- vnode.
- The optional
- .I suspend
- directive tells PBS to suspend jobs running on
- the vnode if the load average exceeds the
- .I $ max_load
- number, regardless of the source of the load (PBS and/or logged-in users).
- Without this directive, PBS will not suspend jobs due to load.
- We recommend setting
- .I load
- to a value that is slightly higher than the number of CPUs,
- for example 0.25 +
- .I ncpus.
- .br
- Default: number of CPUs on machine
- .br
- Format: Float
- .br
- Example:
- .RS 8
- $max_load 3.5
- .RE
- .IP
- .IP "$min_check_poll <seconds>" 5
- Minimum time between polling cycles, in seconds. Must be
- greater than zero and less than
- .I $max_check_poll.
- Minimum recommended value: 10 seconds.
- .br
- Format: Integer.
- .br
- Minimum value: 1 second
- .br
- Default value: 10 seconds
- .IP "$prologalarm <timeout>" 5
- Defines the maximum number of seconds the prologue and epilogue
- may run before timing out. Default: 30 seconds. Integer.
- Example:
- .RS 8
- $prologalarm 30
- .RE
- .IP
- .IP "$reject_root_scripts <True | False>" 5
- When set to
- .I True,
- MoM won't aquire any new hook scripts, and MoM won't run job scripts that would execute
- as root or Admin. However, MoM will run previously-aquired hooks that run as root.
- .br
- Format: Boolean
- .br
- Default: False
- .IP "$restart_background <True | False>" 5
- Controls how MoM runs a restart script after checkpointing a job.
- When this option is set to
- .I True,
- MoM forks a child which runs the restart script. The child returns
- when all restarts for all the local tasks of the job are done. MoM
- does not block on the restart. When this option is set to
- .I False,
- MoM runs the restart script and waits for the result.
- .br
- Format: Boolean
- .br
- Default: False
- .IP "$restart_transmogrify <True | False>" 5
- Controls how MoM runs a restart script after checkpointing a job.
- When this option is set to
- .I True,
- MoM runs the restart script, replacing the session ID of the original
- task's top process with the session ID of the script.
- When this option is set to
- .I False,
- MoM runs the restart script and waits for the result. The restart
- script must restore the original session ID for all the processes of
- each task so that MoM can continue to track the job.
- When this option is set to
- .I False
- and the restart uses an external command, the configuration parameter
- .I restart_background
- is ignored and treated as if it were set to
- .I True,
- preventing MoM from blocking on the restart.
- .br
- Format: Boolean
- .br
- Default: False
- .br
- .IP "$restrict_user <True | False>" 5
- Controls whether users not submitting jobs have access to this
- machine. If
- .I value
- is
- .I True,
- restrictions are applied. See
- .I $restrict_user_exceptions
- and
- .I $restrict_user_maxsysid.
- Not supported on Windows.
- .br
- Format: Boolean
- .br
- Default: False
- .IP "$restrict_user_exceptions <user_list>" 5
- Comma-separated list of users who are exempt from access
- restrictions applied by
- .I $restrict_user.
- Leading spaces within each entry are allowed.
- Maximum of 10 names.
- .IP "$restrict_user_maxsysid <value>" 5
- Any user with a numeric user ID less than or equal to
- .I value
- is exempt from restrictions applied by
- $restrict_user.
- If
- .I $restrict_user
- is
- .I True
- and no
- .I value
- exists for
- .I $restrict_user_maxsysid,
- PBS looks in /etc/login.defs, if it exists, for the
- .I value.
- Otherwise the default is used.
- Integer. Default: 999
- .IP "$restricted <hostname>" 5
- The
- .I hostname
- is added to the list of hosts which are allowed to connect to MoM
- without being required to use a privileged port.
- Hostnames can be
- wildcarded. For example, to allow queries from any host from the
- domain "xyz.com":
- .RS 9
- $restricted *.xyz.com
- .RE
- .IP
- Queries from the hosts in the restricted list are only allowed
- access to information internal to this host, such as load
- average, memory available, etc. They may not run shell commands.
- .IP "$suspendsig <suspend signal> [resume signal]" 5
- Alternate signal
- .I suspend signal
- is used to suspend jobs instead of SIGSTOP. Optional
- .I resume signal
- is used to resume jobs instead of SIGCONT.
- .IP "$tmpdir <directory>" 5
- Location where each job's scratch directory will be created.
- PBS creats a temporary directory for use by the job, not by PBS.
- PBS creates the directory before the the job is run and removes
- the directory and its contents when the job is finished. It is
- scratch space for use by the job. Permission must be 1777 on
- Linux, writable by
- .I Everyone
- on Windows.
- Example:
- .RS 9
- $tmpdir /memfs
- .RE
- .IP
- Default on Linux: /var/tmp
- .br
- Default on Windows: value of the
- .I TMP
- environment variable
- .IP "$usecp <hostname:source prefix> <destination prefix>" 5
- MoM uses /bin/cp to deliver output files when
- the destination is a network mounted file system, or when
- the source and destination are both on the local host, or when
- the
- .I source prefix
- can be replaced with the
- .I destination prefix
- on
- .I hostname.
- Both
- .I source prefix
- and
- .I destination prefix
- are absolute pathnames of directories, not files.
- Overrides
- .I PBS_RCP
- and
- .I PBS_SCP.
- Use trailing
- slashes on both source and destination. For example:
- .RS 9
- $usecp HostA:/users/work/myproj/ /sharedwork/proj_results/
- .RE
- .IP
- .IP "$vnodedef_additive" 5
- Specifies whether MoM considers a vnode that appeared previously
- either in the inventory or in a vnode definition file, but that does
- not appear now, to be in her list of vnodes.
- .br
- When
- .I $vnodedef_additive
- is True, MoM treats missing vnodes as if they
- are still present, and continues to report them as if they are
- present. This means that the server does not mark missing vnodes as
- .I stale.
- .br
- When
- .I $vnodedef_additive
- is False, MoM does not list missing vnodes,
- the server's information is brought up to date with the inventory and
- vnode definition files, and the server marks missing vnodes as
- .I stale.
- .br
- Visible in configuration file on Cray only.
- .br
- Format: Boolean
- .br
- Default for MoM on Cray login node: False
- .IP "$wallmult <factor>" 5
- Each job's walltime usage is multiplied by
- .I factor.
- For example:
- .RS 9
- $wallmult 1.5
- .RE
- .IP
- .RE
- .B Cray-only Initialization Values
- .br
- .IP "pbs_accounting_workload_mgmt <value>" 5
- Controls whether CSA accounting is enabled. Name does not start
- with dollar sign. If set to "1", "on", or "true",
- CSA accounting is enabled. If set to "0", "off", or
- "false", accounting is disabled. Default: "true"; enabled.
- .RE
- .B HPE SGI-only Initialization Values
- .br
- .IP "cpuset_create_flags <flags>" 5
- Lists the flags for when MoM does a cpusetCreate(3) for each job.
- .I flags
- is an or-ed list of flags.
- The flags are:
- .RS
- .IP "HPE SGI formerly known as ICE, with Performance Software - Message Passing Interface" 5
- CPUSET_CPU_EXCLUSIVE
- .br
- 0
- .br
- Default: 0
- .LP
- .RE
- .IP "cpuset_destroy_delay <delay>" 5
- MoM waits up to
- .I delay
- seconds before destroying a cpuset of
- a just-completed job, but not longer than necessary.
- This gives the operating system more time to clean up leftover processes
- after they have been killed.
- Example:
- .RS 9
- cpuset_destroy_delay 10
- .RE
- .IP
- Default for HPE SGI machines: 0
- .br
- Format: Integer
- .IP "memreserved <megabytes>" 5
- .B Deprecated.
- The amount of per-vnode memory reserved for system overhead.
- This much memory is deducted from the value of
- .I resources_available.mem
- for each vnode managed by this MoM.
- .br
- Example:
- .RS 9
- memreserved 16
- .RE
- .IP
- Default: 0MB.
- .br
- .RE
- .B Static Resources
- .br
- Static resources local to the vnode are described
- one resource to a line,
- with a name and value separated by white space.
- For example, tape drives of different types could be specified by:
- .RS 15
- .nf
- .B tape3480 \ \ 4
- .B tape3420 \ \ 2
- .B tapedat \ \ \ \ 1
- .B tape8mm \ \ \ \ 1
- .fi
- .RE
- .RE
- .SH FILES AND DIRECTORIES
- .IP $PBS_HOME/mom_priv 10
- Default directory for default configuration files.
- .IP $PBS_HOME/mom_priv/config 10
- MoM's default configuration file.
- .IP $PBS_HOME/mom_logs 10
- Default directory for log files written by MoM.
- .IP $PBS_HOME/mom_priv/prologue 10
- File containing administrative script to be run before job execution.
- .IP $PBS_HOME/mom_priv/epilogue 10
- File containing administrative script to be run after job execution.
- .SH SIGNAL HANDLING
- .B pbs_mom
- handles the following signals:
- .IP SIGHUP 10
- The
- .B pbs_mom
- daemon rereads its configuration files, closes and reopens the log
- file, and reinitializes resource structures.
- .IP SIGALRM 10
- MoM writes a log file entry. See the
- .I -a alarm_timeout
- option.
- .IP SIGINT 10
- The
- .B pbs_mom
- daemon exits, leaving all running jobs still running.
- See the
- .I -p
- option.
- .IP SIGKILL 10
- This signal is not caught. The
- .B pbs_mom
- daemon exits immediately.
- .IP "SIGTERM, SIGXCPU, SIGXFSZ, SIGCPULIM, SIGSHUTDN" 10
- The
- .B pbs_mom
- daemon terminates all running children and exits.
- .IP "SIGPIPE, SIGUSR1, SIGUSR2, SIGINFO" 10
- These are ignored.
- .LP
- All other signals have their default behavior installed.
- .SH EXIT STATUS
- .IP "Greater than zero" 5
- If the
- .B pbs_mom
- daemon fails to start
- .br
- If the
- .I -s insert
- option is used with an existing
- .I scriptname
- .br
- If the administrator attempts to add a script whose name
- begins with "PBS"
- .br
- If the administrator attempts to use the
- .I -s remove
- option on a nonexistent configuration file, or on a configuration
- file whose name begins with "PBS"
- .br
- If the administrator attempts to use the
- .I -s show
- option on a nonexistent script
- .SH SEE ALSO
- The
- .B PBS Professional Administrator's Guide,
- pbs_server(8B),
- pbs_sched(8B),
- qstat(1B)
|