What is Pegasus?
Pegasus is a workflow management system that helps scientists and engineers execute complex computational workflows. It maps a user’s abstract workflow onto available distributed resources, manages data, and handles execution failures, making it easier to run scientific applications on high-throughput computing (HTC) systems like HTCondor.
Pegasus supports different data staging mechanisms, primarily sharedfs
and condorio
. The sharedfs
mode is used when the head node and all worker nodes share a common file system, allowing jobs to directly access data. However, there is currently a bug that prevents sharedfs
from working as expected. In contrast, the condorio
mode is designed for environments where worker nodes do not share a file system, relying on HTCondor’s built-in file transfer capabilities for all data I/O. Since Pegasus 5.0, condorio
is the default. Our current setup on Expanse, utilizing HTCondor Annex, operates in condorio
mode, leveraging Condor’s efficient data transfer for distributed execution.
Accessing Pegasus on ACCESS
You can access a hosted version of Pegasus through ACCESS. You will need an existing ACCESS account.
- Go to https://support.access-ci.org/tools/pegasus.
- Click on “Local shell access” to get a terminal.
Setting up HTCondor Annex on Expanse
We will follow the documentation for HTCondor Annex, specifically the steps outlined in https://access-ci.atlassian.net/wiki/spaces/ACCESSdocumentation/pages/564887666/HTCondor+Annex.
1. Generate SSH Key
First, generate an SSH key specifically for the annex:
ssh-keygen -f ~/.ssh/annex
2. Configure SSH
Add the following configuration to your ~/.ssh/config
file. This tells SSH to use the newly generated key for Expanse.
Host expanse.sdsc.edu *.expanse.sdsc.edu
User MYUSERNAME # Replace MYUSERNAME with your Expanse username
IdentityFile ~/.ssh/annex
Permissions: Ensure your ~/.ssh/config
file has the correct permissions (read-only for your user) to prevent errors:
chmod 600 ~/.ssh/config
3. Copy SSH Key to Expanse
Copy your public SSH key to Expanse. You will be prompted for your password and MFA code.
ssh-copy-id -i ~/.ssh/annex.pub MYUSERNAME@expanse.sdsc.edu
4. Create a Sample HTCondor Job
Before creating an annex, HTCondor requires a job to execute. Create a file named many_hostname.sub
with the following content:
executable = /bin/hostname
output = out.$(Cluster).$(Process)
error = err.$(Cluster).$(Process)
log = log.$(Cluster)
# 1 core per task so the partitionable slot can split into many tasks
request_cpus = 1
request_memory = 512MB
request_disk = 100MB
# Keep these jobs on the annex
+MayUseAWS = False
requirements = (AnnexName == "zonca") # You can change "zonca" to your desired annex name
queue 128
5. Submit the Sample Job
Submit the job using condor_submit
:
condor_submit many_hostname.sub
6. Create the HTCondor Annex
Now you can create the HTCondor annex. Remember to replace MYUSERNAME
with your Expanse username and set your PROJECT_ID
.
export PROJECT_ID=YOUR_ALLOCATION_ID # Set the ID of your allocation on Expanse
htcondor annex create --nodes 1 --lifetime 3600 --project $PROJECT_ID $USER compute@expanse
Extending or Adding Resources to the HTCondor Annex
To extend the lifetime or add more nodes to an existing HTCondor annex, use the htcondor annex add
command:
htcondor annex add --project $PROJECT_ID --nodes 1 --lifetime 3600 $USER compute@expanse
Monitoring and Output
To check the status of your HTCondor annex, use:
htcondor annex status $USER
During execution, you can also log in to Expanse and monitor the job using squeue
:
squeue -u $USER
Once the job is completed, it will create many out.*
files in your submission directory. Each of these files will contain the hostname of the machine where that specific job ran. Since we requested only 1 node for the annex in this example, all out.*
files will likely contain the same hostname.
Running Jobs with Pegasus
Once the HTCondor Annex is running, you can launch Pegasus workflows. Special thanks to Karan Vahi for creating the example Expanse workflow.
Clone the Pegasus example repository: In the Pegasus access machine terminal, clone the example workflow repository:
bash git clone https://github.com/pegasus-isi/expanse-single-job-wf cd expanse-single-job-wf
Launch the workflow: After making sure that the annex is running, launch the workflow with:
bash python expanse_hostname.py
This command will create all the necessary workflow configuration YAML files in the current folder and submit the job to be executed on Expanse. The workflow consists of a single job runninghostname
. The output of this job will be transferred back and can be accessed inoutput/hostname.out
.