Skip to content

Distributed computing with Dask #2032

@mrocklin

Description

@mrocklin

Hello, I am an author of Dask, a library for parallel and distributed computing in Python. I am curious if there is interest within this community to collaborate on distributing XGBoost on Dask either for parallel training or for ETL.

There are probably two components of Dask that are relevant for this project:

  1. A generic system for parallel and distributed computing, built on arbitrary dynamic task scheduling. The relevant APIs here are probably dask.delayed and concurrent.futures
  2. A parallel and distributed subset of the Pandas API, dask.dataframe useful for feature engineering and data pre-processing. This doesn't implement the entire Pandas API, but comes decently close.

Is there interest in collaborating here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions