« Back

Condor G metascheduling

Condor-G combines the inter-domain resource management protocols of the Globus Toolkit and the intra-domain resource and job management methods of Condor for managing Grid jobs. It allows you to specify a single script to run a job on the grid from a single site just as Globus does. In addition, Condor-G provides advanced job submission and monitoring capabilities. The Condor-G job manager automatically handles file transfers and job I/O while using the Globus Toolkit for job launching. The Condor-G distribution also provides a useful tool called DAGMan to define job dependencies.

Prepare to use Condor-G

Condor-G is installed on many XSEDE login systems and you can include it in your environment by running module load condor-g. On some systems, you may need to use SoftEnv instead of Modules to configure your environment. On these systems, execute soft add +condor-g.

Condor-G uses Globus to run your jobs on remote systems, so you therefore need a user proxy certificate. The recommended way to obtain a proxy certificate is to execute the myproxy-logon command that will contact the XSEDE MyProxy server and generate a certificate for you. For more information on single sign on view Accessing Resources documentation.

Create a Single-Process Job

To run a job via Condor-G, you first must create a Condor submit script that describes the job. A script that will run a single process job is shown below.

# Lines beginning with # symbol are comments

# The following line is only required if you have access
# to multiple projects and want to choose which to charge to.
globusrsl = (project=)

# Submissions to XSEDE must be through the globus universe
universe = globus
# Name of executable to run. Needs full path.~does not work.
executable = /home/ncsa/jdoe/single/a.out

# Command-line argument list
arguments = 100 210

# false means that executable is already on remote machine
# true means to copy the executable from the local machine
# to the remote
transfer_executable = false

# Where to submit the job - the fork jobmanager on Lonestar at TACC
globusscheduler = gatekeeper.lonestar.xsede.org/jobmanager

# Set up names for standard output and error and log files
output = condor1.out
error = condor1.err
log = condor1.log

# To charge your job to a specific allocation,
# specify the identifier of the allocation as the project below
globusrsl = &(project=<projectid>)

# The following line is required. It is the command that
# actually submits this to the Condor-G queue.
queue 

This script assumes that the executable, and any necessary input files, are on the remote machine.

Note that the script specifies the Globus universe. The Globus universe in Condor is intended to provide the standard Condor interface to users who wish to start Globus system jobs from Condor. Each job queued in the job submission file is translated into a Globus RSL string and used as the arguments to the globusrun program.

The globusrsl attribute is used to specify arguments in the Globus RSL language that should be passed directly to Globus. See (ref?) for information on what information can be specified in a Globus RSL. In the script above, a project is specified. The projectid above corresponds to an XSEDE allocation identifier, which is sometimes called a charge number or charge account. To find this through the User Portal, look on the MyXSEDE tab under Allocations/Usage. Click the "Show Project Details" link to expose the value.

Create an MPI Job

The Condor submit script for an MPI job is quites similar to the previous script.

# Lines beginning with # symbol are comments

# The following line is only required if you have access
# to multiple projects and want to choose which to charge to.
globusrsl = (project=)

# Submissions to XSEDE must be through the globus universe
universe = globus

# Name of executable to run. Needs full path. ~Does not work.
Executable = /home/ncsa/jdoe/mpi/a.out

# Command line arguments
arguments = 100 210

# false means that executable is already on remote machine
# true means to copy the executable from the local machine
# to the remote
transfer_executable = false

# Where to submit the job - the SGE jobmanager on Lonestar at TACC
globusscheduler=gatekeeper.lonestar.xsede.org/jobmanager-sge

# Set up names for standard output and error and log files
output = condor1.out
error = condor1.err
log = condor1.log

# The following line is what makes it an mpi job
globusrsl = &(jobType=mpi)(count=4)

# The following line is required. It is the command that
# actually submits this to the Condor-G queue.
Queue<>

The main difference is that the globusrsl specifies that the executable is an MPI program (via the (jobType=mpi) expression in the globusrsl attribute) and that four MPI processes should be started (via the (count=4) expression in the globusrsl attribute). Another difference is that the globusscheduler attribute specifies the PBS jobmanager should be used. That is, the job should be submitted to the PBS cluster scheduler on the remote system for execution on compute nodes.

The output and log files will be located in the directory from which you submit the job. The log file contains information from Condor-G about the job. Check there first for error messages.

Submitting a Condor-G Job

Pick one of the example scripts and save it in a file, condor1. Submit it to the Condor-G queue:

% condor_submit condor1
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 4.

Check Job Status

% condor_q

-- Submitter: ncsa-box1.ncsa.uiuc.edu : <141.142.65.2:1535> : ncsa-box1.ncsa.uiuc.edu
4.0 jdoe 09:06 0+00:00:00 I 0 0.0 a.out

1 jobs; 1 idle, 0 running, 0 held

Cancel a Job

To cancel a job, run the condor_rm command on the job that you want to cancel. Get the job ID from the condor_q output. The following example cancels the job submitted above.

% condor_rm 4.0
Job 4.0 marked for removal

% condor_q

-- Submitter: ncsa-box1.ncsa.uiuc.edu : <141.142.65.2:1535> : ncsa-box1.ncsa.uiuc.edu

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4.0 jdoe 9/6 09:06 0+00:00:00 X 0 0.0 a.out

0 jobs; 0 idle, 0 running, 0 held 

Wait a couple minutes....

% condor_q

-- Submitter: ncsa-box1.ncsa.uiuc.edu : <141.142.65.2:1535> : ncsa-box1.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

Condor DAGMan

A DAGMan script manages tasks, each a single Condor job, and enforces order-of-execution dependencies.

Example DAGMan Script

In this example, there are three Condor-G scripts: staging.condor, setup.condor, and exec.condor. These files must be run in this order. DAGMan handles these dependencies in a script.

Job stage staging.condor
Job setup setp.condor
Job run exec.condor

parent stage child setup
parent stage child run

To run a script named my_job.dag:

condor_dag_submit my_job.dag

Condor-G with Matchmaking

This service allows Condor-G to match jobs to XSEDE resources according to criteria based on both job and cluster preferences. Read the Condor-G with Matchmaking User Guide to learn how to use this service. It has step-by-step instructions to help you write scripts that dynamically select multiple XSEDE resources on which to run.