Release Notes #6

Author

Mitchell Hargreaves

Published

February 21, 2024

Hello MLeRP users,

Since we last had an update we’ve been hard at work refactoring some of the software we use to maintain cluster to prepare it for any growth to come.

Our main focus has been on the way that our provisioning code handles quotas. We had some reports from users with group allocations that the code used to meter disk usage wasn’t behaving as expected. To resolve this, we have moved away from the ‘user’ and ‘group’ quota system that we used to have and implemented ‘project’ quotas.

In this new implementation, ‘user’ and ‘group’ allocations are both metered through the disk usage in their allocated directory, rather than attempting to count usage based on file ownership. We hope that this should be less confusing and address the reports we’ve been getting of some files being counted towards multiple allocations. As before, you will still be able to check the disk usage of all your allocations from your ‘Account Info’ app. If you continue to experience anomalies with your allocated quota, please let us know so we can work through things with you.

Speaking of the ‘Account Info’ page, we’ve added a new feature there that will allow you to download your SSH credentials. This will generate and download an SSH key as well as an SSH certificate that you can use to connect to our cluster using any tool of your choice. These credentials have a limited lifespan and will be cycled periodically, requiring you to log back in and redownload them to refresh your access. The main way we expect users to use this feature is to enable Remote Development through IDEs like VS Code. Previously, the only way to do this was through using our commandline tool, SSOSSH, or to generate your own keys using our terminal app. We hope that this will help remove friction associated with using your favourite tools with our service. If this experimental feature proves stable and useful, we’ll update our documentation to recommend this as the primary method to remotely access our services.

We’ve started a rework of our documentation to address some of the common stumbling blocks we’ve noticed users regularly asking us about. This will include new tutorials covering things like ‘checkpointing’, and refining our existing ones of Dask usage which some have found to be counter intuitive. If there’s a specific section of the documentation that you believe needs attention and you have ideas on how we can clarify it, please let us know.

Our first update to the docs has been to rewrite our outdated Hardware page (formerly the Compute page) to better reflect the current architecture and usage advice. The biggest change you might notice is a description of each QoS that we offer, how they differ, and what they’re each optimised for… which includes the addition of our new QoS Panther! The Panther QoS is restricted to CPU only jobs, which lets us be much more lenient with the wall time (7 days!). This makes it the perfect choice as the host for a jupyter notebook that sends larger workloads out to Dask workers. This lets us reserve the Lion QoS for big workloads such as memory intensive data visualisation or batch processing.

Which leads us into our new experimental Strudel2 app - the Batch Job app! This application, which will be backed by the Lion QoS, will allow you to queue scripts that you’ve delveoped to be picked up by the SLURM scheduler without ever having to use the sbatch command. The application will allow you to select any compute requirements from a web form, such as number of CPUs, RAM, GPU number and flavour, and run a script from a path that you specify. At present we support python and bash scripts, but in the future we also hope to support automatically converting the ipython notebooks you’ve been working on in our Jupyter Lab app through nbconvert. We’d also like to support a method for our power users to paste their SBATCH scripts into a text box in the form for full control. Once we’ve solidified the app’s functionality and ironed out any bugs, we will finalise its documentation along with the other documentation work we’re doing.

To support this app, we thought it was important to be able to view the results of your jobs from the Strudel2 web application. We’ve now added a View log button to all job records, running or completed. This feature will be available for all Strudel2 applications, not just the Batch Job app. To keep things a little tidier, Strudel2 logs will now be placed in a dedicated folder rather than filling up your home directory: ~/.strudel2/logs/. Note that this will not apply to any logs generated by Dask jobs since they are not spawned by Strudel2 - if you’d like to control where these land you can specify with the log_directory paramater.

Another big change is that we’ve now deprecated the MLeRP-QCIF region, leaving only the MLeRP-Monash region which will serve all users going forward. If you are an inactive MLeRP-QCIF user, we have taken a copy of the data which is now hosted in the Monash region. If you’d like to revisit the MLeRP service now that it is more mature, we can make your old qcif data available to you as a part of the provisioning process on request.

Finally, the last change that we want to tell you about is an administrative one. Going forwards, we’re going to be changing our help email to mlerphelp@monash.edu. This move away from our Gmail address is largely to help deal with some issues we’ve had with our user communications, such as account provision notifications, being caught by university spam filters. Any mail that is sent to our old address should be forwarded on to the new address, so we shouldn’t miss anything in the transition.

As always, we hope these changes will make the experience on our platform a smoother one as we continue to develop features to enable your research.

Regards,
Mitchell Hargreaves