Workflow Engines Overview¶
Everyone's computing needs are different, so we ensured that quacc is interoperable with a variety of modern workflow management tools. There are 300+ workflow management tools out there, so we can't possibly support them all. Instead, we have focused on a select few that adopt a similar decorator-based approach to defining workflows with substantial support for HPC systems.
Summary¶
Recommendations
Not sure which to choose? In general, we recommend starting with Parsl for most HPC users. For a more feature-rich workflow orchestration platform, we recommend trying Prefect or Covalent depending on your needs. Some additional opinions on the matter:
- Covalent: You want a visual dashboard and are prioritizing the use of distributed compute resources, especially cloud compute.
- Dask: You already are familiar with the Dask ecosystem and are happy to stick with it.
- Parsl: You want to run many workflows as fast as possible on one or more job scheduler-based HPC machines.
- Prefect: You want a visual dashboard with a robust workflow management platform and are familiar with the basic concepts of workflow orchestration.
- Redun: You are running calculations on AWS.
- Jobflow: You are affiliated with the Materials Project or are already using Jobflow and/or FireWorks.
Covalent is a user-friendly workflow management solution from the company Agnostiq.
Pros:
- Excellent visual dashboard for job monitoring
- Easy to use in distributed, heterogeneous compute environments
- Excellent documentation
- Automatic and simple database integration
- The compute nodes do not need to be able to connect to the internet
Cons:
- It requires a centralized server to be running continuously in order to manage the workflows unless using Covalent Cloud
- Support for job scheduler HPC environments is available but not as robust or performant as other solutions
- High-security HPC environments may be difficult to access via SSH with the centralized server approach
- Not as widely used as other workflow management solutions
Dask is a popular parallel computing library for Python. We use Dask Delayed for lazy function execution, Dask Distributed for distributed compute, and (optionally) Dask-Jobqueue for orchestrating the execution on HPC machines.
Pros:
- Extremely popular
- Has native support for running on HPC resources
- It does not involve a centralized server or network connectivity
- Supports adaptive scaling of compute resources
- The dashboard to monitor resource usage is very intuitive
Cons:
- If the Dask cluster dies, there is no mechanism to gracefully recover the workflow history
- Monitoring job progress is more challenging and less detailed than other solutions
- The documentation, while comprehensive, can be difficult to follow given the various Dask components
- Calculations cannot be submitted remotely or across disparate compute resources
Parsl is a workflow management solution out of Argonne National Laboratory, the University of Chicago, and the University of Illinois. It is well-adapted for running on virtually any HPC environment with a job scheduler.
Pros:
- Extremely configurable and deployable for virtually any HPC environment
- Quite simple to define the workflows and run them from a Jupyter Notebook
- Thorough documentation and active user community across academia
- Well-suited for pilot jobs and advanced queuing schemes
- Does not rely on maintaining a centralized server
Cons:
- The number of different terms can be slightly overwhelming to those less familiar with HPC
- Monitoring job progress is more challenging and less detailed than other solutions
- Debugging failed workflows can be difficult
- The pilot job model is often a new concept to many HPC users that takes some time to understand
Prefect is a workflow orchestration tool that is popular in the data engineering community. It has an excellent dashboard for monitoring workflows.
Pros:
- Quite popular among the data engineering community
- Excellent web-based dashboard for monitoring workflow progress
- The free version of Prefect Cloud is reasonably generous
- Can use advanced queueing schemes to manage workflow execution
- New features are being added regularly and rapidly
Cons:
- For those who are less HPC-savvy, some of the concepts can be quite technical
- If using Prefect Cloud, the compute nodes must have a network connection
- The dashboard, while useful for monitoring successes and failures, is not ideal for analyzing results
- The software is geared more towards data engineering than scientific computing, and that is reflected in the features and documentation
Redun is a flexible workflow management program developed by Insitro.
Pros:
- Extremely simple syntax for defining workflows
- Has strong support for task/result caching
- Useful CLI-based monitoring system
- Very strong AWS support
Cons:
- Currently lacks support for typical HPC job schedulers
- No user-friendly GUI for job monitoring
- Does not have a particularly active user community
- Not updated frequently
Jobflow is developed and maintained by the Materials Project team at Lawrence Berkeley National Laboratory and serves as a seamless interface to FireWorks or Jobflow Remote for dispatching and monitoring compute jobs.
Warning
Jobflow is not yet compatible with the @flow
or @subflow
decorators used in many quacc recipes and so should only be used if necessary. See this issue to track the progress of this enhancement.
Pros:
- Native support for a variety of databases
- Directly compatible with Atomate2
- Designed with materials science workflows in mind
- Actively supported by the Materials Project team
Cons:
- Is not fully compatible with all the features of
quacc
- Parsing the output of a workflow is not as intuitive as other solutions
- Defining dynamic workflows with Jobflow's
Response
object can be more complex than other solutions - FireWorks is not the most user-friendly, and Jobflow Remote is in active development