Integrate clusters into the `DoubleMLData` class, Refactor data generators, refactor sampling using Mixin Class #338

JanTeichertKluge · 2025-06-17T14:38:01Z

Description

This pull request introduces updates to the doubleml library, focusing on refactoring the support for cluster data, improving modularity, and deprecating unused features. Key changes include the addition of cluster-related functionality, deprecation of time (t_col) and score/selection (s_col) variables, and updates to documentation and examples to reflect these changes.

Refactoring for Cluster Data:

Added support for cluster variables (cluster_cols) in the DoubleMLData class, including a new is_cluster_data flag to indicate cluster data usage.
Moved methods and properties to handle cluster variables, such as _set_cluster_vars and cluster_vars.
Deprecated the DoubleMLClusterData class, replacing it with DoubleMLData using is_cluster_data=True. Warnings are added to inform users about the planned removal in version 0.12.0.

Refactoring for Model Specific Data Backends:

Removed t_col (time variable) and s_col (score/selection variable) from DoubleMLData and related methods, as they are no longer relevant except for the data backends used in e.g. DoubleMLDID or DoubleMLSSM
Updated the _data_summary_str method and other internal logic to exclude references to these deprecated variables.

Refactoring Sampling Procedure using Mixin Class:

Removed draw_sample_splitting and set_sample_splitting methods from all model / child classes.
Created SampleSplittingMixins Class having both methods.
Created specific _initialize_dml_model methods for DoubleML, DoubleMLQTE and DoubleMLAPOs.

Codebase Modularity:

Updated __init__.py files to include new data classes (DoubleMLDIDData, DoubleMLPanelData, DoubleMLRDDData, DoubleMLSSMData) and removed unused imports.
Refactored disjoint set checks to accommodate the new cluster_cols logic.

Refactoring of Data Generators / Fetch Methods

Moved model specific data generators into the model specfic submodules
Adjusted the imports and the documention examples

Documentation and Examples:

Updated examples in the documentation to use the new submodules, e.g. doubleml.plm.datasets dgp path.

PR Checklist

The title of the pull request summarizes the changes made.
The PR contains a detailed description of all changes and additions.
References to related issues or PRs are added.
The code passes all (unit) tests.
Enhancements or new feature are equipped with unit tests.
The changes adhere to the PEP8 standards.

S update cross sectional did

doubleml/data/did_data.py

doubleml/datasets/fetch_401K.py

doubleml/datasets/fetch_bonus.py

doubleml/did/datasets/dgp_did_SZ2020.py

doubleml/rdd/tests/test_rdd_exceptions.py

doubleml/rdd/tests/test_rdd_return_types.py

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

… with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Co-authored-by: Copilot <[email protected]>

doubleml/data/did_data.py

…the-doublemldata-class

doubleml/did/datasets/dgp_did_SZ2020.py

SvenKlaassen · 2025-06-19T05:44:27Z

doubleml/data/base_data.py

Do we really need the input argument is_cluster_data?
I think this can be infered via cluster_cols

doubleml/data/panel_data.py

doubleml/data/tests/test_panel_data_exceptions.py

doubleml/utils/_check_return_types_fixed.py

doubleml/did/did_binary.py

Remove is_cluster_data parameter from DoubleMLData.init() and from_arrays() Automatically set _is_cluster_data in cluster_cols setter based on whether cluster_cols is not None Remove is_cluster_data setter to make property read-only and inferred Update dataset generators, tests, and backwards compatibility classes to remove explicit is_cluster_data=True arguments Maintain compatibility by keeping is_cluster_data property for existing code Fixes test collection errors by eliminating unused parameter

…leMLAPOS` classes.

doubleml/double_ml.py

doubleml/irm/apos.py

doubleml/irm/qte.py

JanTeichertKluge and others added 6 commits June 17, 2025 16:01

Merge pull request #337 from DoubleML/s-update-cross-sectional-did

5056151

S update cross sectional did

fix RDDData (finally...)

70d67ad

adjsut RDD Class

a322e35

adjust DID classes

0a9b3c7

Adjust unit tests for DID

37f11dc

Adjust RDD unit tests

7be2d8f

JanTeichertKluge requested review from SvenKlaassen and Copilot June 17, 2025 14:38

JanTeichertKluge linked an issue Jun 17, 2025 that may be closed by this pull request

[Feature Request]: Integrate Clusters into the DoubleMLData Class #305

Closed

This comment was marked as outdated.

Sign in to view

github-advanced-security bot found potential problems Jun 17, 2025

View reviewed changes

JanTeichertKluge added 9 commits June 17, 2025 16:46

minor changes in high lvl unit tests

cbb3818

minor changes in high lvl unit tests

fb4f440

fix rdd unit tests

cc5a110

fix exception unit test

80a890e

fix unit tests for cluster variables (kwd arg instead of positional arg)

0207b67

update checks for correct data backend type

987f8b3

adjust unit tests

45b1c35

adjust unit tests

7c27750

adjust unit tests

025b75e

JanTeichertKluge requested a review from Copilot July 2, 2025 13:14

JanTeichertKluge self-assigned this Jul 2, 2025

JanTeichertKluge linked an issue Jul 2, 2025 that may be closed by this pull request

Rework the datasets module #272

Closed

Potential fix for code scanning alert no. 419: Unused import

270ed20

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

JanTeichertKluge and others added 4 commits July 2, 2025 15:17

Potential fix for code scanning alert no. 414: Unused local variable

c129395

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Potential fix for code scanning alert no. 415: Unused local variable

a76d4a7

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Potential fix for code scanning alert no. 421: Explicit returns mixed…

4b9a81b

… with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Update doubleml/utils/_check_return_types.py

1ffcbc6

Co-authored-by: Copilot <[email protected]>

github-advanced-security bot found potential problems Jul 2, 2025

View reviewed changes

doubleml/data/did_data.py Fixed Show fixed Hide fixed

Merge branch 'main' into 305-feature-request-integrate-clusters-into-…

0a52e5f

…the-doublemldata-class

github-advanced-security bot found potential problems Sep 1, 2025

View reviewed changes

doubleml/did/datasets/dgp_did_SZ2020.py Fixed Show fixed Hide fixed

SvenKlaassen added 3 commits September 1, 2025 13:44

fix import

3ff1810

correct aliases

10a500a

add cluster exception test for rdd

e735652

SvenKlaassen approved these changes Sep 1, 2025

View reviewed changes

JanTeichertKluge added 12 commits September 1, 2025 16:58

forgot pre-commit...

68017d7

Create SampleSplittingMixin for DoubleML, DoubleMLQTE and `Doub…

883f780

…leMLAPOS` classes.

Implement SampleSplittingMixin in DoubleML, DoubleMLQTE and DoubleMLAPOS

1869e32

fix n_obs in APOs

b00b7ae

add _strata attribute to classes before sample splitting

5ef4714

add set_sample_splitting method to mixin

aa61bd9

fix stratification for APOs model

d98efe8

refactor DoubleML to use mixin

cc7963e

implement new methods of mixin into model classes

447b628

add new _initialize_dml_model method to __init__ methods

99bf711

run pre-commit

61ef95d

github-advanced-security bot found potential problems Oct 2, 2025

View reviewed changes

doubleml/double_ml.py Fixed Show fixed Hide fixed

doubleml/irm/apos.py Fixed Show fixed Hide fixed

doubleml/irm/qte.py Fixed Show fixed Hide fixed

adjust mixin class

4a4bd17

JanTeichertKluge changed the title ~~Integrate clusters into the DoubleMLData class, Refactor data generators~~ Integrate clusters into the DoubleMLData class, Refactor data generators, refactor sampling using Mixin Class Oct 2, 2025

remove pass statement in abstract method.

4ed1f51

SvenKlaassen approved these changes Oct 2, 2025

View reviewed changes

correct selection variable array type to remove warning

d80459b

SvenKlaassen merged commit 92081ad into main Oct 7, 2025
12 checks passed

This was referenced Oct 7, 2025

Update Docs for datasets and data classes DoubleML/doubleml-docs#251

Merged

Update Data Backend and dataset paths DoubleML/doubleml-coverage#28

Merged

This was linked to issues Oct 16, 2025

Refactor Data Generators #306

Closed

[Feature Request]: Create separate data class for RDD #307

Closed

JanTeichertKluge mentioned this pull request Oct 28, 2025

LPLR model #365

Draft

Integrate clusters into the DoubleMLData class, Refactor data generators, refactor sampling using Mixin Class #338

Integrate clusters into the DoubleMLData class, Refactor data generators, refactor sampling using Mixin Class #338

Uh oh!

Conversation

JanTeichertKluge commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Refactoring for Cluster Data:

Refactoring for Model Specific Data Backends:

Refactoring Sampling Procedure using Mixin Class:

Codebase Modularity:

Refactoring of Data Generators / Fetch Methods

Documentation and Examples:

PR Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

SvenKlaassen Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Integrate clusters into the `DoubleMLData` class, Refactor data generators, refactor sampling using Mixin Class #338

Integrate clusters into the `DoubleMLData` class, Refactor data generators, refactor sampling using Mixin Class #338

JanTeichertKluge commented Jun 17, 2025 •

edited

Loading