Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lightning talk - Managing cluster lifecycle wit...

Lightning talk - Managing cluster lifecycle with dask-ctl

Jacob Tomlinson

May 20, 2021
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Worker Worker Worker Scheduler Cluster Manager Tight lifecycle coupling Today

    the Python process contains the cluster manager, which contains the only references to the resources created. This makes sense if the scheduler and workers are subprocesses on the same machine. But less so when they are remote and independent resources on a cloud or HPC platform. Python Process Remote Cluster resources
  2. Worker Worker Worker Scheduler Tight lifecycle coupling What happens to

    the cluster resources if the Python process that created them is killed? Sometimes the OS may clean them up. Sometimes they will time out or hit a wall time. Sometimes they will exist until you manually delete them and cost you money. Cluster resources !??
  3. Forcibly restarting your notebook kernel shouldn’t leave cluster resources on

    some cloud or other platform that need to be killed manually.
  4. Discovery The big challenge around managing cluster lifecycle is being

    able to discover existing clusters and then reconstruct the cluster manager which represents them. With dask-ctl cluster managers can register a discovery method via the dask_cluster_discovery entrypoint. To support dask-ctl cluster managers should implement this entrypoint and search for clusters (by listing jobs/pods/vm/etc looking for Dask cluster resources) and then return an iterable of cluster names and cluster managers which can recreate them.
  5. Reconstruction Once clusters can be discovered then all cluster managers

    need to have a way of reconstructing the representation by the clusters name/uuid. In dask-ctl we try to call a ClusterManager.from_name(name/uuid) class method to reconstruct the cluster object. To support dask-ctl cluster managers also need to implement this method.
  6. Lifecycle Once we can discover clusters and reconstruct cluster managers

    to represent then we can always perform lifecycle management operations including: • Getting logs • Connecting clients • Scaling • Deleting Create Scale Compute Delete
  7. Next steps • Implement dask-ctl support on as many cluster

    managers as possible. • Add support for cluster discovery to the Dask Jupyter Lab Extension so that discovered clusters are listed in the sidebar. • Stabilize things and move from dask-contrib org to dask.