Troubleshooting

Why do I still have to wait in queue?

Directly attaching GPU compute to notebooks leads to very low utilisation since the GPU sits idle while you’re debugging your code. By using a queue we are able to service more researchers at once with greater efficiency.

If I still have to wait in queue how is this better than using a traditional HPC environment?

MLeRP is designed with the idea that a job’s size should be about the size of a cell in a jupyter notebook. With that in mind, the queue has been optimised for short jobs. The maximum wall time of jobs in the MLeRP cluster is 30 minutes. Executing a cell won’t always be immediate but your job should start up pretty quickly. If you need to run a job that runs longer than this, checkpoint your code and split it into multiple jobs. If your code is mature enough that it makes more sense to run a multi-hour or multi-day job, consider using a traditional HPC service. Contact us at mlerphelp@monash.edu if your jobs aren’t starting promptly or your usecase isn’t covered by our platform and you think it should be.

Why Dask?

We looked at a few different options for the MLeRP environment including Sagemaker, Spark and Ray. Ultimately we settled on Dask as the primary tool to interface with the queue because:

  • SLURM jobs can be submitted with SLURMCluster whenever HPC is needed
  • Debugging can be done locally first with LocalCluster with minimal code change
  • It has a familiar syntax as it’s designed as a light wrapper around common libraries like Numpy, Pandas and SciKit-Learn
  • High level applications can be implemented easily with Scikit-Learn, XGBoost, Skorch, SciKeras
  • Dask can submit any python function to the SLURM queue allowing the flexibility for bespoke low level applications
  • Lazy evaluation of functions which allows for asynchronous code

Will I need to change my code to work with MLeRP?

Yes. Dask unfortunately will not ‘just work’, you will have to do some code change to use it to interface with the cluster. You will also need to get a sense of how to request resources from the cluster with SLURM.

That said we believe that this approach of submitting jobs through a python notebook environment will feel more familiar to researchers familiar with the python datascience ecosystem given how Dask is designed as a light wrapper around common libraries like numpy, pandas and Scikit-Learn.

You will also be able to work with the cluster without needing to convert your experimental notebook code into a script and maintain the environment with modules like with a traditional cluster.

Why aren’t my print statements showing up in my jobs?

Print statements that are executed on remote machines won’t show up in your notebook. If you are using print statements for debugging, consider using a LocalCluster where they will behave as expected.

If you need to record information while code is executing remotely either pass the information back to the notebook when the function returns for it to be printed, or log the output to a file.

For more information about using SLURMCluster, visit Dask’s documentation.

Should I use cluster.scale or cluster.adapt?

We recommend that you use the adapt method while you’re actively developing your code so that you don’t need to worry about cleaning up after yourself. The scale method can be used when you’re ready to run longer tests with higher utilisation.

How do I install my favourite python package?

If you want to control the python environment we recommend that you install and maintain a miniconda environment in your userdata directory.

What is a daemonic process and why can’t I run one?

A daemon is a process that runs as a background process. Dask prefers to control all processes so that it can manage them more gracefully if they fail. If you need to take control of the multiprocessing yourself, you can turn this off with LocalCluster(processes=False) and SLURMCluster(nanny=False).