Skip to content

Conversation

@JanTeichertKluge
Copy link
Member

@JanTeichertKluge JanTeichertKluge commented Jun 17, 2025

Description

This pull request introduces updates to the doubleml library, focusing on refactoring the support for cluster data, improving modularity, and deprecating unused features. Key changes include the addition of cluster-related functionality, deprecation of time (t_col) and score/selection (s_col) variables, and updates to documentation and examples to reflect these changes.

Refactoring for Cluster Data:

  • Added support for cluster variables (cluster_cols) in the DoubleMLData class, including a new is_cluster_data flag to indicate cluster data usage.
  • Moved methods and properties to handle cluster variables, such as _set_cluster_vars and cluster_vars.
  • Deprecated the DoubleMLClusterData class, replacing it with DoubleMLData using is_cluster_data=True. Warnings are added to inform users about the planned removal in version 0.12.0.

Refactoring for Model Specific Data Backends:

  • Removed t_col (time variable) and s_col (score/selection variable) from DoubleMLData and related methods, as they are no longer relevant except for the data backends used in e.g. DoubleMLDID or DoubleMLSSM
  • Updated the _data_summary_str method and other internal logic to exclude references to these deprecated variables.

Refactoring Sampling Procedure using Mixin Class:

  • Removed draw_sample_splitting and set_sample_splitting methods from all model / child classes.
  • Created SampleSplittingMixins Class having both methods.
  • Created specific _initialize_dml_model methods for DoubleML, DoubleMLQTE and DoubleMLAPOs.

Codebase Modularity:

  • Updated __init__.py files to include new data classes (DoubleMLDIDData, DoubleMLPanelData, DoubleMLRDDData, DoubleMLSSMData) and removed unused imports.
  • Refactored disjoint set checks to accommodate the new cluster_cols logic.

Refactoring of Data Generators / Fetch Methods

  • Moved model specific data generators into the model specfic submodules
  • Adjusted the imports and the documention examples

Documentation and Examples:

  • Updated examples in the documentation to use the new submodules, e.g. doubleml.plm.datasets dgp path.

PR Checklist

  • The title of the pull request summarizes the changes made.
  • The PR contains a detailed description of all changes and additions.
  • References to related issues or PRs are added.
  • The code passes all (unit) tests.
  • Enhancements or new feature are equipped with unit tests.
  • The changes adhere to the PEP8 standards.

This comment was marked as outdated.

@JanTeichertKluge JanTeichertKluge requested a review from Copilot July 2, 2025 13:14
@JanTeichertKluge JanTeichertKluge self-assigned this Jul 2, 2025
@JanTeichertKluge JanTeichertKluge linked an issue Jul 2, 2025 that may be closed by this pull request
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

This comment was marked as outdated.

JanTeichertKluge and others added 4 commits July 2, 2025 15:17
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
… with implicit (fall through) returns

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the input argument is_cluster_data?
I think this can be infered via cluster_cols

Remove is_cluster_data parameter from DoubleMLData.init() and from_arrays()
Automatically set _is_cluster_data in cluster_cols setter based on whether cluster_cols is not None
Remove is_cluster_data setter to make property read-only and inferred
Update dataset generators, tests, and backwards compatibility classes to remove explicit is_cluster_data=True arguments
Maintain compatibility by keeping is_cluster_data property for existing code
Fixes test collection errors by eliminating unused parameter
@JanTeichertKluge JanTeichertKluge changed the title Integrate clusters into the DoubleMLData class, Refactor data generators Integrate clusters into the DoubleMLData class, Refactor data generators, refactor sampling using Mixin Class Oct 2, 2025
@SvenKlaassen SvenKlaassen merged commit 92081ad into main Oct 7, 2025
12 checks passed
@JanTeichertKluge JanTeichertKluge mentioned this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants