Execute Pegasus Jobs on Expanse

sdsc
hpc
Author

Andrea Zonca

Published

September 9, 2025

What is Pegasus?

Pegasus is a workflow management system that helps scientists and engineers execute complex computational workflows. It maps a user’s abstract workflow onto available distributed resources, manages data, and handles execution failures, making it easier to run scientific applications on high-throughput computing (HTC) systems like HTCondor.

Pegasus supports different data staging mechanisms, primarily sharedfs and condorio. The sharedfs mode is used when the head node and all worker nodes share a common file system, allowing jobs to directly access data. However, there is currently a bug that prevents sharedfs from working as expected. In contrast, the condorio mode is designed for environments where worker nodes do not share a file system, relying on HTCondor’s built-in file transfer capabilities for all data I/O. Since Pegasus 5.0, condorio is the default. Our current setup on Expanse, utilizing HTCondor Annex, operates in condorio mode, leveraging Condor’s efficient data transfer for distributed execution.

Accessing Pegasus on ACCESS

You can access a hosted version of Pegasus through ACCESS. You will need an existing ACCESS account.

  1. Go to https://support.access-ci.org/tools/pegasus.
  2. Click on “Local shell access” to get a terminal.

Setting up HTCondor Annex on Expanse

We will follow the documentation for HTCondor Annex, specifically the steps outlined in https://access-ci.atlassian.net/wiki/spaces/ACCESSdocumentation/pages/564887666/HTCondor+Annex.

1. Generate SSH Key

First, generate an SSH key specifically for the annex:

ssh-keygen -f ~/.ssh/annex

2. Configure SSH

Add the following configuration to your ~/.ssh/config file. This tells SSH to use the newly generated key for Expanse.

Host expanse.sdsc.edu *.expanse.sdsc.edu
   User MYUSERNAME # Replace MYUSERNAME with your Expanse username
   IdentityFile ~/.ssh/annex

Permissions: Ensure your ~/.ssh/config file has the correct permissions (read-only for your user) to prevent errors:

chmod 600 ~/.ssh/config

3. Copy SSH Key to Expanse

Copy your public SSH key to Expanse. You will be prompted for your password and MFA code.

ssh-copy-id -i ~/.ssh/annex.pub MYUSERNAME@expanse.sdsc.edu

4. Create a Sample HTCondor Job

Before creating an annex, HTCondor requires a job to execute. Create a file named many_hostname.sub with the following content:

executable      = /bin/hostname
output          = out.$(Cluster).$(Process)
error           = err.$(Cluster).$(Process)
log             = log.$(Cluster)

# 1 core per task so the partitionable slot can split into many tasks
request_cpus    = 1
request_memory  = 512MB
request_disk    = 100MB

# Keep these jobs on the annex
+MayUseAWS      = False
requirements    = (AnnexName == "zonca") # You can change "zonca" to your desired annex name

queue 128

5. Submit the Sample Job

Submit the job using condor_submit:

condor_submit many_hostname.sub

6. Create the HTCondor Annex

Now you can create the HTCondor annex. Remember to replace MYUSERNAME with your Expanse username and set your PROJECT_ID.

export PROJECT_ID=YOUR_ALLOCATION_ID # Set the ID of your allocation on Expanse
htcondor annex create --nodes 1 --lifetime 3600 --project $PROJECT_ID $USER compute@expanse

Extending or Adding Resources to the HTCondor Annex

To extend the lifetime or add more nodes to an existing HTCondor annex, use the htcondor annex add command:

htcondor annex add --project $PROJECT_ID --nodes 1 --lifetime 3600 $USER compute@expanse

Monitoring and Output

To check the status of your HTCondor annex, use:

htcondor annex status $USER

During execution, you can also log in to Expanse and monitor the job using squeue:

squeue -u $USER

Once the job is completed, it will create many out.* files in your submission directory. Each of these files will contain the hostname of the machine where that specific job ran. Since we requested only 1 node for the annex in this example, all out.* files will likely contain the same hostname.

Running Jobs with Pegasus

Once the HTCondor Annex is running, you can launch Pegasus workflows. Special thanks to Karan Vahi for creating the example Expanse workflow.

  1. Clone the Pegasus example repository: In the Pegasus access machine terminal, clone the example workflow repository: bash git clone https://github.com/pegasus-isi/expanse-single-job-wf cd expanse-single-job-wf

  2. Launch the workflow: After making sure that the annex is running, launch the workflow with: bash python expanse_hostname.py This command will create all the necessary workflow configuration YAML files in the current folder and submit the job to be executed on Expanse. The workflow consists of a single job running hostname. The output of this job will be transferred back and can be accessed in output/hostname.out.