Exploring machine learning frameworks for cloud and supercomputing
In the first blog post of this series we discussed the need for combining modern cloud-based machine learning frameworks with the batch processing pipelines used in high performance computing (HPC) environments. In the second blog post, we looked at the different components of such a combined machine learning framework, and discussed some of the potential challenges. In this final part of the blog post series we will look at existing frameworks for implementing the combined machine learning workflow for HPC environments.
A big proportion of the existing machine learning frameworks are designed to be run in the cloud. Some frameworks are tied to a specific commercial cloud provider, such as Amazon or Google, while others also enable you to run on your own cloud setup, using technologies like Kubernetes or Spark. Metaflow, developed by Netflix, is one example from the former category, which is tied to Amazon’s AWS platform, and is thus not suitable for our purposes. In the latter category, one could mention Kubeflow (Kubernetes), H2O (Kubernetes, Spark or virtual machine cluster), Seldon (Kubernetes), Polyaxon (Kubernetes, Spark or virtual machine cluster) and STACKn (Kubernetes). Also worth mentioning is Valohai which is a commercial hosted MLOps service that can use Openstack or any of the commercial cloud providers as the backend. Combining the Slurm batch scheduling system with Kubernetes container orchestration or Kubeflow machine learning toolkit is also very relevant for our target of integrating cloud and supercomputing environments.
In addition to the end-to-end cloud frameworks mentioned above, there are a few other lower-level tools worth mentioning. First, MLflow is a popular framework, which can be used to track experiments and package machine learning projects and trained models for easier reuse. DVC is another similar project, based on the git version control system, which also enables the tracking of datasets. Workflow engines, such as Airflow, Pachyderm or Snakemake, would allow building of automated pipelines with multiple steps and branching.
While not directly related to machine learning, we should also mention Open OnDemand at this point. It’s a web interface for HPC systems, where users can easily manage their files and launch jobs in the cluster. It can also be used for launching a Jupyter notebook session, for example on a GPU compute node.
In our previous blog post, we identified the main components of a machine learning framework. The framework should include an easy to use web browser UI for code development and easy access to existing repositories of models and datasets. The developed ML models and used datasets should be versioned. There needs to be a way to run batch jobs and visualise the progress and results of the batch job runs. The framework should also support deploying the developed models for inference.
We also proposed that the interactive part should run on Kubernetes, while the batch jobs would run in the HPC cluster. These would naturally also represent two different user experiences: fast-paced interactive notebook development, and hour-to-hour or day-to-day batch job management.
The interactive environment could be set up as a simple way to launch interactive sessions on a Kubernetes-based cloud environment. A Jupyter Notebook environment would be started, and the appropriate software framework (such as TensorFlow or PyTorch) could be selected from a list of available containers. The datasets to work on could be managed with DVC, with some kind of integration to the graphical user interface. The actual data might be stored in an object storage system such as CSC’s Allas.
Once the developed code has reached a sufficient level of maturity, it could be transferred to the HPC environment. Here the user would be presented with something like the Open OnDemand graphical user interface or easy-to-use command line tools. The code would be launched using the same (or compatible) containers as in the interactive side. Datasets, models and code would all be handled with the appropriate version control systems (for example, git with DVC). In addition, MLflow tracking could be used to track the metrics of different runs and provide a visualisation of the results. Ideally, this might be integrated into the Open OnDemand user interface.
One challenge, as mentioned in our previous blog post, is how to run and parameterise notebooks. One option would be to use the Papermill tool for this. Another option is to convert the notebooks to regular Python scripts before launching.
Finally, inference deployment could be easily managed also with Kubernetes, having the possibility to launch services using a specific software environment from a git repository using MLflow Projects and with a particular version of the trained model from MLflow model registry.
In this discussion our emphasis has been on UI driven usage. Most of the components also provide programming API’s, allowing the user to automate workflows and even to build new higher-level services on top of them. However, using API’s of components directly might not be optimal when they are integrated to function as a whole. The ideal solution would be to implement the frontend of the environment, web browser UI for code development and access, following the “API first” principle. This way the functionality behind the UI would first be available as a REST interface and the web UI would be implemented using it. This API would provide a powerful way to automate and extend the system, while also encapsulating the details of the components from the user of the API.
We started this blog series to search for the ultimate open source framework to bridge cloud and supercomputing and to create a perfect environment for massive machine learning tasks. Some of the available frameworks tick an impressive number of boxes needed for full-fledged machine learning work, but still fall short when it comes to integration with Slurm or other batch processing systems of the HPC world. Additionally, many open source products only include the core functionality and leave important plugins to their commercially licensed edition. However, we discovered that components already available in open source provide quite a comprehensive set of features, so that building the ultimate open source tool would be mostly about writing the glue to combine different parts together. The most labour intensive part would be to provide a good user experience, as it would likely require writing new user facing components to separate users from the diverse tooling that runs in the backend. Some of these challenges will be tackled while setting up the environment for the upcoming EuroHPC LUMI supercomputer, while looking for additional collaborations to implement the complete vision of an integrated open machine learning environment we drafted here.
Juha Hulkkonen, CSC: The author is a data engineering and machine learning specialist in CSC’s data analytics group, working with machine learning and big data workflows
Aleksi Kallio, CSC: The author is the manager of CSC’s data analytics group, coordinating development of machine learning and data engineering based services.
Markus Koskela, CSC: The author is a machine learning specialist in CSC’s data analytics group, working with various machine learning applications and computing environments.
Mats Sjöberg, CSC: The author is a machine learning specialist in CSC’s data analytics group, working with various machine learning applications and computing environments.
This blog post is part 3/3 of a blog series written by CSC experts. This blog was originally published at CSC’s website.
Image: Adobe Stock