In coordination with the ACCESS team, XSEDE has paused the processing of Startup, Education, and other allocation requests from August 17-31. This pause will ensure that no requests are lost while we are making infrastructure updates behind the scenes and handing things over to ACCESS. If you have questions, contact help@xsede.org. The ACCESS team will resume taking allocation requests on September 1 via https://access-ci.org/.

The discussion forums in the XSEDE User Portal are for users to share experiences, questions, and comments with other users and XSEDE staff. Visitors are welcome to browse and search, but you must login to contribute to the forums. While XSEDE staff monitor the lists, XSEDE does not guarantee that questions will be answered. Please note that the forums are not a replacement for formal support or bug reporting procedures through the XSEDE Help Desk. You must be logged in to post to the user forums.

« Back to General Discussion

Help with `srun nvidia-smi` for power measurements

Combination View Flat View Tree View
Threads [ Previous | Next ]
I'm currently writing a script to get power measurements from the NVIDIA V100 GPU on Expanse-GPU (SDSC). Out of the three methods possible (`nvidia-smi`, NVML, and CUPTI), it seems that `nvidia-smi` has the lowest barrier to entry because it comes with the NVIDIA drivers.

Currently I have the following three concerns:

1. How to deal with Slurm.
For some reason, at Expanse-GPU, after getting an interactive node on the `gpu-shared` partition (or any other Expanse-GPU partition, for that matter) , you can only run `srun nvidia-smi` but not `nvidia-smi` directly. This suggests that either Expanse has a three-tiered node hierarchy (login, batch, compute) like Summit, or for some reason the environment isn't setup properly outside of `srun`. Which one of these two scenarios is true?

2. How to spawn and kill `nvidia-smi` properly.
This is a technical aspect, but I'm not sure how to reliably spawn and kill (or start and stop) `nvidia-smi` in the Slurm job script. I am only mildly familiar with `pidof` and `kill`.

3. How to set polling interval appropriately.
This is more of a conceptual thing. In quantum physics, a measurement changes the system. I think a similar concept applies here since we're polling the GPU by using `nvidia-smi`. If the frequency is too high, we might incur a performance cost. If the frequency is too low, the data will be inaccurate and the program to be analyzed might finish entirely within the polling interval. How can I determine the appropriate polling interval?

I have also submitted ticket #21526 to the Expanse help desk.