SubmitIt Offloading

SubmitIt is a lower level library than Dask which you can also use to offload parts of your notebook to the SLURM queue. Rather than managing a cluster, you will instead directly be submitting python functions to the SLURM queue giving you more control. For more information, have a read of their PyPi page.

import submitit

# Define where we'd like submitit to place our logs
executor = submitit.AutoExecutor(folder='~/submitit_logs')

# Define the parameters of our slurm job
# Just like Dasks' job_extra_directives, additional_parameters allows us to specify things that submitit doesn't support directly
executor.update_parameters(timeout_min=30, mem_gb=128, cpus_per_task=16, slurm_partition="BigCats", slurm_additional_parameters={"gres": "gpu:1"})

We can submit our function to the cluster with the executor.submit method. This will return a future which can be unpacked with its result using future.result() just like when we were working with Dask. Because we are offloading to the SLURM queue print statements will not be visible, just like with Dask SLURMClusters. However, the full stack trace is still visible when an error or assertion is raised within the function.

def client_test(input1, input2, error=False, test=False):
    # Force an error
    if error:
        assert 0 == 1
    
    # Stop after one batch when testing        
    if test: 
        print("When running in a local cluster you can see print statements!")

    return input1, input2

future = executor.submit(client_test, "input1", "input2", test=True)
future.result()

('input1', 'input2')

future = executor.submit(client_test, "input1", "input2", error=True)
future.result()

FailedJobError: Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/apps/mambaforge/envs/dsks_2024.06/lib/python3.10/site-packages/submitit/core/submission.py", line 55, in process_job
    result = delayed.result()
  File "/apps/mambaforge/envs/dsks_2024.06/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/tmp/ipykernel_1235436/858968069.py", line 4, in client_test
AssertionError

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /home/mhar0048/submitit_logs/6952_0_log.err
  - /home/mhar0048/submitit_logs/6952_0_log.out

Note that since we are interacting directly with the queue, we don’t need to clean up and shut down our cluster when using SubmitIt.

If needed we can be more specific about the specific GPU type and QoS we need if we have more complex requirements.

executor.update_parameters(timeout_min=30, mem_gb=128, cpus_per_task=16, slurm_partition="BigCats", slurm_additional_parameters={"gres": "gpu:3g.20gb:1", "partition": "BigCats"})
executor.submit(client_test, "input1", "input2", test=True).result()

/apps/mambaforge/envs/dsks_2024.06/lib/python3.10/site-packages/submitit/auto/auto.py:23: UserWarning: Setting 'additional_parameters' is deprecated. Use 'slurm_additional_parameters' instead.
  warnings.warn(f"Setting '{arg}' is deprecated. Use '{new_arg}' instead.")

('input1', 'input2')

Comparison with Dask

As you can see, we’ve implemented the same use case with both Dask and SubmitIt. Which begs the question - which should you use for your research?

Both packages have pros and cons, but on the whole, Dask is much better suited towards tasks which can benefit from being broken into many small tasks - like when preprocessing your data. SubmitIt on the other hand is much better suited for use cases where you are looking to offload one larger job at a time, like when you are training.

Of the two, Dask is the more mature package with more flexibility and complete documentation - but if you are looking for a simple offloading package it is often far more complexity than you need.