In coordination with the ACCESS team, XSEDE has paused the processing of Startup, Education, and other allocation requests from August 17-31. This pause will ensure that no requests are lost while we are making infrastructure updates behind the scenes and handing things over to ACCESS. If you have questions, contact help@xsede.org. The ACCESS team will resume taking allocation requests on September 1 via https://access-ci.org/.

The discussion forums in the XSEDE User Portal are for users to share experiences, questions, and comments with other users and XSEDE staff. Visitors are welcome to browse and search, but you must login to contribute to the forums. While XSEDE staff monitor the lists, XSEDE does not guarantee that questions will be answered. Please note that the forums are not a replacement for formal support or bug reporting procedures through the XSEDE Help Desk. You must be logged in to post to the user forums.

« Back to General Discussion

Flash scratch storage - what happens if job runs out of walltime?

Combination View Flat View Tree View
Threads [ Previous | Next ]
Hi all,

I currently have a trial allocation on expanse and I am trying to get my software (ORCA) configured as best as possible before I run some benchmarks.

Often I am running restartable jobs, and will run out of walltime and need to resubmit the job. I would like to use the SSD flash scratch storage to hold my running jobs, the storage that is accessed under "/scratch/$USER/job/$SLURM_JOB_ID". The guide specifies this is only accessible during a job run. Obviously, after the job is done, whether it finishes successfully or fails, it goes to my next command in the slurm script which is to move all of the data back to home directory. But if it runs out of walltime while ORCA is still running, that next command to move the data back to home will not run. I was wondering what I can do to ensure that I am not wasting my SUs of that job by losing all of the data.

Other clusters have a "orphan" folder where this lost data is sometimes held for a time. Do XSEDE clusters or expanse have an option like this?

Alternatively, I have used epilog scripts before in TORQUE, and I know SLURM also has the option of epilog scripts. Will an epilog script have access to that scratch folder, and if so can I use an epilog script to copy it back so that it will always copy whether the job runs out of walltime or not?