Technical Documentation

  1. Abstract
  2. About Moab
  3. Login
  4. Building a job script
  5. Submitting a job with PBS options
  6. Monitoring jobs
  7. Fairshare

 

Abstract

Moab is a Workload Manager product from Adaptive Computing. In August 2013, Moab was selected as the common batch scheduler for the UIC’s new Extreme High Performance Cluster. This tutorial presents the essentials for using Moab on extreme cluster. Basic topics include how to build batch scripts, submit, monitor, change, hold/release, and cancel jobs.

About Moab

  • Moab is a Workload Manager product of Adaptive Computing, Inc. (www.adaptivecomputing.com), formerly known as Cluster Resources, Inc.
  • A Workload Manager can be loosely called a batch scheduler, but there can be important differences, depending upon what “flavor” of batch scheduler you’re talking about.
  • Resource Manager:
    • Manages batch jobs for a single cluster
    • Includes a job launch facility as well as a simple FIFO job queue
    • Includes user commands for interacting with jobs
    • Has an interface into the cluster’s high-speed interconnect
    • Examples: SLURM, LoadLeveler, TORQUE …
  • Workload Manager:
    • A scheduler that ties a number of resource managers together into one domain
    • Allows a job to run on any cluster in the domain regardless of which cluster it was actually submitted from
    • Implements the policies that govern job priority (e.g., fair-share), job limits, and consolidates resource collection and accounting
    • Provides a single and consistent user interface despite differences in the underlying native resource managers
    • Examples: LCRM, PBS, Moab, LSF …

MOAB VS TORQUE

 

Login

 

If you’ve been granted access, you should log into the cluster using your UIC netID and ACCC common password.
Use an SSH client to connect to login-1.extreme.uic.edu.  For Unix, Linux, and OS X systems, run “ssh login-1.extreme.uic.edu -l <netid>” (where <netid> is your netID).  If you run Windows and do not have SSH and SCP clients, you may download PuTTY and PSCP for free (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html).
Cluster scheduling is handled by the Moab Workload Manager version 7.2.5.  Job submissions are performed by running the “qsub” command.

If you need assistance, please submit a ticket by emailing extreme@uic.edu.

 

Building a job script

  • Just like other batch systems, users submit jobs to Moab for scheduling by means of a job script
  • A job script is a plain text file that you create with your favorite editor.
  • Job scripts can include any/all of the following:
    • Commands, directives and syntax specific to a given batch system
    • Shell scripting
    • References to environment variables
    • Names of executable(s) to run
    • Comment lines and white space
  • Moab supports the following types of batch job scripts:
    • LSF (Platform Computing)
    • PBS (Altaire)
    • LoadLeveler (IBM)
    • Torque (Moab – very similar to PBS)
    • SSS XML (Scalable Systems Software Job Object Specification)
  • Recommendation: Always include your preferred shell as the first line in your batch script. Otherwise, you may inherit the default /bin/sh shell. For example:
    #!/bin/csh
    #!/bin/tcsh
    #!/bin/ksh
  • Batch scheduler syntax is parsed upon job submission. Shell scripting is parsed at runtime. Therefore, it is entirely possible to successfully submit a job to Moab that has shell script errors which won’t cause problems until later when the job actually runs.
  • Submitting binary executables directly (without a script) is not advised.
  • A simple Moab job control script appears below. The various #PBS options are discussed in the Submitting Jobs and PBS Options section.
#!/bin/csh
##### These lines are for Moab
#PBS -l nodes=16
#PBS -l partition=cab
#PBS -l walltime=2:00:00
#PBS -q pbatch
#PBS -m be
#PBS -V
#PBS -o /p/lscratchb/joeuser/par_solve/myjob.out

##### These are shell commands
date
cd /p/lscratchb/joeuser/par_solve
srun -n128 a.out
echo 'Done'
  • NOTE: All #PBS lines must come before shell script commands.

 

Submitting a job with PBS options

  • The qsub command is used to submit your job script to Moab. Upon successful submission, Moab returns the job’s ID and spools it for execution. For example:
    % qsub myjobscript
    
    226783
    
    % qsub  -q pdebug myjobscript
    
    227243

     

  • The qsub command has a number of options that can be used in either your job script or from the command line. Some of the more common/useful options are shown below – see the Moab documentation for a full discussion.
  • PBS options

    Script Command Line Description/Notes
    #PBS -a
    -a
    Declares the time after which the job is eligible for execution.
    Syntax: (brackets delimit optional items with the default being current date/time):

    [CC][YY][MM][DD]hhmm[.SS]
    #PBS -A account
    -A account
    Defines the account associated with the job.
    #PBS -d path
    -d path
    Specifies the directory in which the job should begin executing.
    #PBS -e filename
    -e filename
    Defines the file name to be used for stderr.
    #PBS -h
    -h
    Put a user hold on the job at submission time.
    #PBS -j oe
    -j oe
    Combine stdout and stderr into the same output file. This is the default. If you want to give the combined stdout/stderr file a specific name, include the -o path flag also.
    #PBS -l string
    -l string
    Defines the resources that are required by the job. See the discussion below for this important flag.
    #PBS -m option(s)
    -m option(s)
    Defines the set of conditions (a=abort,b=begin,e=end) when the server will send a mail message about the job to the user.
    #PBS -N name
    -N name
    Gives a user specified name to the job. Note that job names do not appear in all Moab job info displays, and do not determine how your job’s stdout/stderr files are named.
    #PBS -o filename
    -o filename
    Defines the file name to be used for stdout.
    #PBS -p priority
    -p priority
    Assigns a user priority value to a job. See the discussion under Setting Job Priority.
    #PBS -q queue
    #PBS -q queue@host
    -q queue
    Run the job in the specified queue (pdebug, pbatch, etc.). A host may also be specified if it is not the local host.
    #PBS -r y
    -r y
    Automatically rerun the job is there is a system failure. The default behavior at LC is to NOT automatically rerun a job in such cases.
    #PBS -v list
    -v list
    Specifically adds a list (comma separated) of environment variables that are exported to the job.
    #PBS -V
    -V
    Declares that all environment variables in the qsub environment are exported to the batch job.
    #PBS -W
    -W
    This option has been deprecated and should be ignored.

 Usage Notes:

  • You can submit jobs for any machine within your Moab grid, but not machines outside your grid. See Moab Grid Configurations for details.
  • After you submit your job script, changes to the contents of the script file will have no effect on your job because Moab has already spooled it to system file space.
  • Users may submit and queue as many jobs as they like, up to a reasonable configuration defined limit. The actual number of running jobs per user is usually a lower limit, however. These limits may vary between machines.
  • If your command line qsub option conflicts with the same option in your script file, the script option will override what is specified on the command line (in most cases?).
  • In your job script, the #PBS token is case insensitive, however the parameters it specifies are case sensitive.
  • The default directory is where you submit your job from. If you need to be in another directory, then your job script will need to cd to that directory or use the -d flag.

 Discussion on the -l Option:

  • The -l option is one of the most important qsub options. It is used to specify a number of resource requirements for your job.
  • Syntax for the -l option is very specific, and the Moab documentation can be difficult to find. Currently, the “Resource Manager Extensions” section of the Moab Administrator’s Guide is the best place to look (see References).
  • Examples:
    Examples Resource
    -l depend=jobid
    Dependency upon completion of another job. jobid is the Moab jobid for the job that must complete first. See Setting Up Dependent Jobs for details.
    -l feature=lustre
    -l feature=32GB
    Requirement for a specific node feature. Use the mdiag -t command to see what features are available on a node.
    -l gres=filesystem
    -l gres=filesystem, filesystem
    -l gres=ignore
    Job requires the specified parallel Lustre file system(s). Valid labels are the names of LC Lustre parallel file systems, such as lscratchrza, lscratchb, lscratch1 .... The purpose of this option is to prevent jobs from being scheduled if the specified file system is unavailable. The default is to require all mounted lscratch file systems. The ignore descriptor can be used for jobs that don’t require a parallel file system, enabling them to be scheduled even if there are parallel file system problems. More information is available HERE.
    -l nodes=256
    Number of nodes. The default is one.
    -l partition=cab
    -l partition=zin|juno
    -l partition=hera:cab
    -l partition=ALL
    Run job on a specific cluster in a Moab grid
    Run a job on either cluster in a Moab grid
    Run a job on either cluster in a Moab grid
    Run a job on any cluster in a Moab grid
    -l procs=256
    Number of processes. This option can be used instead of the nodes= option and Moab will automatically figure out how many nodes to use.
    -l qos=standby
    Quality of service (standby, expedite)
    -l resfailpolicy=ignore
    -l resfailpolicy=requeue
    Try to keep job running if a node fails
    Requeue the job automatically if it fails
    -l signal=14@120
    -l signal=SIGHUP@2:00
    Signaling – specifies the pre-termination signal to be sent to a job at the desired time before expiration of the job’s wall clock limit. Default time is 60 seconds.
    -l ttc=8
    Stands for “total task count”. Discussed in the Running on the Aztec and Inca Clusters section of this tutorial.
    -l walltime=600
    -l walltime=12:00:00
    Wall clock time. Default units are seconds. HH:MM:SS format is also accepted.

     

  • If more than one resource needs to be specified, the best thing to do is to use a separate #PBS -l line for each resource. For example:
    #PBS -l nodes=64
    #PBS -l qos=standby
    #PBS -l walltime=2:00:00
    #PBS -l partition=cab
  • Alternately you can include all resources on a single #PBS -l line separated with commas and NO white space, For example:
    #PBS -l nodes=64,qos=standby,walltime=2:00:00,partition=cab

    WARNING: White space gives Moab problems with -l option specifications, which Moab may/may not tell you about. For example, a white space after any comma will cause the rest of the line to be ignored without an error message. On the other hand, a white space on either side of the equal sign will elicit a job rejection with an error message.

Monitoring jobs

There are several different job monitoring commands. They’re following:

 showq:

  • Jobs for all partitions in the Moab grid will be displayed – not just for the machine you issue the command from. See Moab Grid Configurations for details.
  • The showq command has several options. A few that may prove useful include:
    • -r shows only running jobs plus additional information such as partition, qos, account and start time.
    • -i shows only idle jobs plus additional information such as priority, qos, account and class.
    • -b shows only blocked jobs
    • -p partition shows only those jobs on a specified partition. Can be combined with -r, -i and -b to further narrow the scope of the display.
    • -c shows recently completed jobs.
  • Possible output fields:
        JOBID = unique job identifier
        S = state, where R means Running, etc.
        PAR = partition
        EFFIC = cpu efficiency of the job. Not applicable at LC.
        XFACTOR = expansion factor (QueueTime + WallClockLimit) / WallClockLimit
        Q = quality of service, abbreviated to 2 characters; no=normal, st=standby, etc.
        USERNAME = owner of job
        ACCNT = account
        MHOST = master host running primary task for job
        NODES = number of nodes used
        REMAINING = time left to run
        STARTTIME = when job started
        CCODE = user defined exit/completion code.
        WALLTIME = wall clock time used by job
      COMPLETIONTIME = when the job finished

     

  • See the Moab documentation for other options and further explanation of the output fields.
  • Examples below.
    % showq -r
    active jobs------------------------
    JOBID          S  PAR  EFFIC  XFACTOR  Q  USERNAME    ACCNT            MHOST NODES   REMAINING            STARTTIME
    
    258442         R  azt   0.00      1.0 no    rrfong  micphys          aztec85     1  8:07:07:29  Wed Sep 19 10:33:17
    258274         R  azt   0.00      1.0 no    rrfong  micphys          aztec19     1  8:07:07:27  Wed Sep 19 10:33:15
    258271         R  azt   0.00      1.0 no    ewwong    medcm          aztec70     1  8:06:33:21  Wed Sep 19 09:59:09
    [ some output deleted here ]
    
    256918         R  cab   0.00      1.4 no       wwu   elmctl           cab102    32     3:21:28  Tue Sep 18 22:52:58
    257192         R  cab   0.00      1.2 no       wwu   elmctl           cab101    32     3:21:28  Tue Sep 18 22:52:58
    256242         R  cab   0.00      1.7 no    e3pin1     pls2           cab112    20    00:44:42  Tue Sep 18 20:41:12
    
    58 active jobs        18480 of 19664 processors in use by local jobs (93.98%)
                          1167 of 1229 nodes active      (94.96%)
    
    Total jobs:  58

    % showq -i -p cab
    
    eligible jobs----------------------
    JOBID            PRIORITY  XFACTOR  Q  USERNAME    ACCNT  NODES     WCLIMIT     CLASS      SYSTEMQUEUETIME
    
    256553*          -1183652     47.1 no    reesom  asccasc     25    00:30:00    pdebug   Tue Sep 18 12:26:18
    257966*          -1197006      1.1 no      wjrt     pls2    193    16:00:00    pbatch   Wed Sep 19 09:20:20
    258275*          -1201240      1.1 no   mm55ale     pls2     32    16:00:00    pbatch   Wed Sep 19 10:00:25
    
    [ some output deleted here ]
    
    256677*          -1453594      2.3 no  miw2wros michigan     96    16:00:00    pbatch   Tue Sep 18 14:13:16
    256678*          -1453594      2.3 no  miw2wros michigan     96    16:00:00    pbatch   Tue Sep 18 14:13:39
    256682           -1453594      2.3 no  miw2wros michigan     96    16:00:00    pbatch   Tue Sep 18 14:13:55
    
    51 eligible jobs   
    
    Total jobs:  51

    % showq -c
    
    completed jobs---------------------
    JOBID           S CCODE         PAR  EFFIC  XFACTOR  Q  USERNAME    ACCNT            MHOST NODES    WALLTIME       COMPLETIONTIME
    
    258461          R 0               - ------      0.0 no   b44ham2    wprod                -     1    00:00:00                     -
    256774          R CNCLD           - ------      0.0 no    wwng23    wprod                -     1    00:00:00                     -
    256606          R 0               - ------      0.0 no   depr4ro      eng                -     1    00:00:00              
    
    [ some output deleted here ]
    
    254349          C 0             azt   0.00     15.3 no  schai332 latticgc          aztec82     1    00:52:54   Wed Sep 19 11:30:12
    254358          C 0             azt   0.00     15.3 no  schai332 latticgc          aztec15     1    00:48:57   Wed Sep 19 11:30:58
    258156          C 0             cab   0.00      1.3 no    lane39      eng            cab89    16    00:54:00   Wed Sep 19 11:31:10
    
    15582 completed jobs   (purgetime: 4:00:00:00)  
    
    Total jobs:  15582

 checkjob:

  • Displays detailed job state information and diagnostic output for a selected job.
  • The checkjob command is probably the most useful user command for troubleshooting your job, especially if used with the -v flag. Sometimes, additional diagnostic information can be viewed by using multiple “v”s: -vv or -v -v.
  • This command can also be used for completed jobs.
  • See the Moab documentation for other options and an explanation of the output fields.
  • Examples below.
    % checkjob -v 257903
    job 257903
    
    AName: inch03
    State: Running 
    Creds:  user:dwwn13  group:dwwn13  account:bdivp  class:pbatch  qos:normal
    WallTime:   3:06:18 of 12:00:00
    BecameEligible: Wed Sep 19 08:27:28
    SubmitTime: Wed Sep 19 08:26:42
      (Time Queued  Total: 00:00:48  Eligible: 00:00:27)
    
    StartTime: Wed Sep 19 08:27:30
    Job Templates: fs
    TemplateSets:  fs.set
    NodeMatchPolicy: EXACTNODE
    Total Requested Tasks: 240
    Total Requested Nodes: 16
    
    Req[0]  TaskCount: 256  Partition: cab  
    Dedicated Resources Per Task: PROCS: 1  lscratchb: 1  lscratchc: 1  lscratchd: 1
    NodeAccess: SINGLEJOB
    NodeCount:  16
    
    Allocated Nodes:
    cab[42,197,200,235,274,497,515,575,711,720,762,906,1128,1155,1210,1251]*16
    
    SystemID:   omoab
    SystemJID:  257903
    Task Distribution: cab1155,cab575,cab42,cab906,cab1128,cab497,cab1251,cab720,cab197,cab762,cab274,...
    
    IWD:            /p/lscratchc/n1user/dwwn13_runs/nasa/extend/symm1/inch03
    SubmitDir:      /p/lscratchc/n1user/dwwn13_runs/nasa/extend/symm1/inch03
    UMask:          0077 
    Executable:     /var/opt/moab/spool/moab.job.dAlTNj
    
    OutputFile:     inch03.o%j (-)
    NOTE:  stdout/stderr will be merged
    StartCount:     1
    User Specified Partition List:   cab
    System Available Partition List: cab,hera,aztec
    Partition List: cab
    SrcRM:          internal  DstRM: cab  DstRMJID: 257903
    Flags:          PREEMPTOR,GLOBALQUEUE
    Attr:           fs.set
    Variables:      RHOME=/g/g11/dunn13
    StartPriority:  0
    Task Range:     1 -> 99999
    PE:             256.00
    Reservation '257903' (-3:07:21 -> 8:52:39  Duration: 12:00:00)

    % checkjob 258506 (shows a completed job)
    job 258506
    
    AName: mxterm
    State: Completed 
    Completion Code: 0  Time: Wed Sep 19 11:35:18
    Creds:  user:d33f  group:d33f  account:lc  class:pdebug  qos:normal
    WallTime:   00:00:07 of 00:05:00
    SubmitTime: Wed Sep 19 11:34:26
      (Time Queued  Total: 00:00:45  Eligible: 00:00:00)
    
    Job Templates: fs
    Total Requested Tasks: 1
    
    Req[0]  TaskCount: 1  Partition: aztec  
    Dedicated Resources Per Task: PROCS: 1  lscratchb: 1  lscratchc: 1  lscratchd: 1
    
    Allocated Nodes:
    [aztec7:1]
    
    SystemID:   omoab
    SystemJID:  258506
    
    IWD:            /g/g0/d33f
    Executable:     /var/opt/moab/spool/moab.job.bE1Clh
    
    Execution Partition:  aztec
    Flags:          PREEMPTOR,GLOBALQUEUE,PROCSPECIFIED
    Variables:      RHOME=/g/g0/donf
    StartPriority:  0

    % checkjob 303 (shows a job with a problem)
    job 303
    
    State: Idle 
    Creds:  user:blaise  group:blaise  account:cs  class:pbatch  qos:standby
    WallTime:   00:00:00 of 00:30:00
    SubmitTime: Tue Mar 13 09:57:55
      (Time Queued  Total: 00:01:02  Eligible: 00:00:00)
    
    Total Requested Tasks: 2
    
    Req[0]  TaskCount: 2  Partition: ALL  
    Memory >= 1M  Disk >= 1M  Swap >= 0
    Opsys:   ---  Arch: ---  Features: ---
    NodesRequested:  2
    
    IWD:            $HOME/moab
    Executable:     /var/opt/moab/spool/moab.job.cUOwWF
    Partition Mask: [hera]
    Flags:          PREEMPTEE,IGNIDLEJOBRSV,GLOBALQUEUE
    Attr:           PREEMPTEE
    StartPriority:  2188189
    Holds:          Batch:NoResources  
     NOTE:  job cannot run  (job has hold in place)
    BLOCK MSG: job hold active - Batch (recorded at last scheduling iteration)

 mdiag -j:

  • Provides a one line display of job information per job.
  • Jobs for all partitions in the Moab grid will be displayed – not just for the machine you issue the command from. See Moab Grid Configurations for details.
  • The mdiag -j -v syntax provides additional information.
  • Examples below.
    % mdiag -j 75025
    JobID                 State Proc     WCLimit     User  Opsys  Class Features
    
    75025               Running   32     365days  st78itz      - pdebug -

    % mdiag -j
    
    JobID                 State Proc     WCLimit     User  Opsys  Class Features
    
    635                 Running 8512    10:00:00   chards      - pbatch -
    637                    Idle 1064    10:00:00   chards      - pbatch -
    657                    Idle    1  1:00:00:00   kkkert      - pbatch -
    664                    Idle    1    00:30:00   r55ndo      - pbatch -
    76343               Running   16     2:00:00     t6ee      - pdebug -
    76346               Running  128     2:00:00   x55ser      - pdebug -

 

Fairshare

UIC operates on a shared cluster model where individual investors buy in certain nodes but when nodes are unused, other researchers/investors can used them to gain more effective utilization of the cluster. We operate under “Fairshare Model” that ensures optimal usage for investors. Based on the purchase of nodes, users/groups get proportionate priority to use the cluster.  To better understand fairshare, please see the diagram below

Fairshare Scenario Diagram

*- Documentation adapted from https://computing.llnl.gov/tutorials/moab/