Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit 39445bf

Browse files
authored
Merge branch 'master' into test-sqeleton-pr15
2 parents bc0f757 + 14193b9 commit 39445bf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

101 files changed

+10778
-1109
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,9 @@ docs/_build/
7474
# PyBuilder
7575
target/
7676

77+
# Exception for dbt tests
78+
!tests/dbt_artifacts/target
79+
7780
# Jupyter Notebook
7881
.ipynb_checkpoints
7982

README.md

Lines changed: 40 additions & 121 deletions
Original file line numberDiff line numberDiff line change
@@ -2,160 +2,79 @@
22
<img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="50%" />
33
</p>
44

5-
# **data-diff**
5+
<h1 align="center">
6+
data-diff
7+
</h1>
8+
9+
<h2 align="center">
10+
Develop dbt models faster by testing as you code.
11+
</h2>
12+
<h4 align="center">
13+
See how every change to dbt code affects the data produced in the modified model and downstream.
14+
</h4>
15+
<br>
616

717
## What is `data-diff`?
8-
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables. It's fast, easy to use, and reliable. Even at massive scale.
918

10-
## Documentation
19+
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
1120

12-
[**🗎 Documentation website**](https://docs.datafold.com/os_diff/about) - our detailed documentation has everything you need to start diffing.
21+
<div align="center">
1322

14-
### Databases we support
23+
![development_testing_gif](https://user-images.githubusercontent.com/1799931/236354286-d1d044cf-2168-4128-8a21-8c8ca7fd494c.gif)
1524

16-
- PostgreSQL >=10
17-
- MySQL
18-
- Snowflake
19-
- BigQuery
20-
- Redshift
21-
- Oracle
22-
- Presto
23-
- Databricks
24-
- Trino
25-
- Clickhouse
26-
- Vertica
27-
- DuckDB >=0.6
28-
- SQLite (coming soon)
25+
</div>
2926

30-
For their corresponding connection strings, check out our [detailed table](https://docs.datafold.com/os_diff/databases_we_support).
27+
<br>
3128

32-
#### Looking for a database not on the list?
33-
If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list.
34-
35-
## Use cases
36-
37-
### Diff Tables Between Databases
38-
#### Quickly identify issues when moving data between databases
39-
40-
<p align="center">
41-
<img alt="diff2" src="https://user-images.githubusercontent.com/1799931/196754998-a88c0a52-8751-443d-b052-26c03d99d9e5.png" />
42-
</p>
43-
44-
### Diff Tables Within a Database
45-
#### Improve code reviews by identifying data problems you don't have tests for
46-
<p align="center">
47-
<a href=https://www.loom.com/share/682e4b7d74e84eb4824b983311f0a3b2 target="_blank">
48-
<img alt="Intro to Diff" src="https://user-images.githubusercontent.com/1799931/196576582-d3535395-12ef-40fd-bbbb-e205ccae1159.png" width="50%" height="50%" />
49-
</a>
50-
</p>
51-
52-
&nbsp;
53-
&nbsp;
54-
55-
## Get started
56-
57-
### Installation
58-
59-
#### First, install `data-diff` using `pip`.
29+
## Getting Started
6030

31+
**Install `data-diff`**
6132
```
6233
pip install data-diff
6334
```
6435

65-
#### Then, install one or more driver(s) specific to the database(s) you want to connect to.
66-
67-
- `pip install 'data-diff[mysql]'`
68-
69-
- `pip install 'data-diff[postgresql]'`
70-
71-
- `pip install 'data-diff[snowflake]'`
72-
73-
- `pip install 'data-diff[presto]'`
74-
75-
- `pip install 'data-diff[oracle]'`
76-
77-
- `pip install 'data-diff[trino]'`
78-
79-
- `pip install 'data-diff[clickhouse]'`
80-
81-
- `pip install 'data-diff[vertica]'`
82-
83-
- For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/
84-
85-
_Some drivers have dependencies that cannot be installed using `pip` and still need to be installed manually._
86-
87-
### Run your first diff
88-
89-
Once you've installed `data-diff`, you can run it from the command line.
90-
36+
**Update a few lines in your `dbt_project.yml`**
9137
```
92-
data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
38+
#dbt_project.yml
39+
vars:
40+
data_diff:
41+
prod_database: my_database
42+
prod_schema: my_default_schema
9343
```
9444

95-
Be sure to read [the docs](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line) for detailed instructions how to build one of these commands depending on your database setup.
96-
97-
#### Code Example: Diff Tables Between Databases
98-
Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres.
45+
**Run your first data diff!**
9946

10047
```
101-
data-diff \
102-
postgresql://<username>:'<password>'@localhost:5432/<database> \
103-
<table> \
104-
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
105-
<TABLE> \
106-
-k activity_id \
107-
-c activity \
108-
-w "event_timestamp < '2022-10-10'"
48+
dbt run && data-diff --dbt
10949
```
11050

111-
#### Code Example: Diff Tables Within a Database
51+
We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details.
11252

113-
Here's a code example from [the video](https://www.loom.com/share/682e4b7d74e84eb4824b983311f0a3b2), where we compare data between two Snowflake tables within one database.
53+
Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started!
11454

115-
```
116-
data-diff \
117-
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
118-
<SCHEMA_2>.<TABLE_2> \
119-
-k org_id \
120-
-c created_at -c is_internal \
121-
-w "org_id != 1 and org_id < 2000" \
122-
-m test_results_%t \
123-
--materialize-all-rows \
124-
--table-write-limit 10000
125-
```
126-
127-
In both code examples, I've used `<>` carrots to represent values that **should be replaced with your values** in the database connection strings. For the flags (`-k`, `-c`, etc.), I opted for "real" values (`org_id`, `is_internal`) to give you a more realistic view of what your command will look like.
128-
129-
### We're here to help!
130-
131-
We know that in some cases, the data-diff command can become long and dense. And maybe you're new to the command line.
55+
<br><br>
13256

133-
* We're here to help [on slack](https://locallyoptimistic.slack.com/archives/C03HUNGQV0S) if you have ANY questions as you use `data-diff` in your workflow.
134-
* You can also post a question in [GitHub Discussions](https://github.com/datafold/data-diff/discussions).
57+
### Diffing between databases
13558

59+
Check out our [documentation](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md) if you're looking to compare data across databases (for example, between Postgres and Snowflake).
13660

137-
To get a Slack invite - [click here](https://locallyoptimistic.com/community/)
61+
<br>
13862

139-
## How to Use
63+
## Contributors
14064

141-
* [How to use from the shell (or: command-line)](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line)
142-
* [How to use from Python](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_python)
143-
* [How to use with TOML configuration file](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_toml)
144-
* [Usage Analytics & Data Privacy](https://docs.datafold.com/os_diff/usage_analytics_data_privacy)
145-
146-
## How to Contribute
147-
* Feel free to open an issue or contribute to the project by working on an existing issue.
148-
* Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started.
149-
150-
Big thanks to everyone who contributed so far:
65+
We thank everyone who contributed so far!
15166

15267
<a href="https://github.com/datafold/data-diff/graphs/contributors">
15368
<img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
15469
</a>
15570

156-
## Technical Explanation
71+
<br>
72+
73+
## Analytics
74+
75+
* [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)
15776

158-
Check out this [technical explanation](https://docs.datafold.com/os_diff/technical_explanation) of how data-diff works.
77+
<br>
15978

16079
## License
16180

data_diff/__init__.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
from typing import Sequence, Tuple, Iterator, Optional, Union
22

3-
from sqeleton.abcs import DbKey, DbTime, DbPath
3+
from data_diff.sqeleton.abcs import DbTime, DbPath
44

55
from .tracking import disable_tracking
66
from .databases import connect
77
from .diff_tables import Algorithm
88
from .hashdiff_tables import HashDiffer, DEFAULT_BISECTION_THRESHOLD, DEFAULT_BISECTION_FACTOR
99
from .joindiff_tables import JoinDiffer, TABLE_WRITE_LIMIT
1010
from .table_segment import TableSegment
11-
from .utils import eval_name_template
11+
from .utils import eval_name_template, Vector
1212

1313

1414
def connect_to_table(
@@ -51,8 +51,8 @@ def diff_tables(
5151
# Extra columns to compare
5252
extra_columns: Tuple[str, ...] = None,
5353
# Start/end key_column values, used to restrict the segment
54-
min_key: DbKey = None,
55-
max_key: DbKey = None,
54+
min_key: Vector = None,
55+
max_key: Vector = None,
5656
# Start/end update_column values, used to restrict the segment
5757
min_update: DbTime = None,
5858
max_update: DbTime = None,
@@ -87,8 +87,8 @@ def diff_tables(
8787
update_column (str, optional): Name of updated column, which signals that rows changed.
8888
Usually updated_at or last_update. Used by `min_update` and `max_update`.
8989
extra_columns (Tuple[str, ...], optional): Extra columns to compare
90-
min_key (:data:`DbKey`, optional): Lowest key value, used to restrict the segment
91-
max_key (:data:`DbKey`, optional): Highest key value, used to restrict the segment
90+
min_key (:data:`Vector`, optional): Lowest key value, used to restrict the segment
91+
max_key (:data:`Vector`, optional): Highest key value, used to restrict the segment
9292
min_update (:data:`DbTime`, optional): Lowest update_column value, used to restrict the segment
9393
max_update (:data:`DbTime`, optional): Highest update_column value, used to restrict the segment
9494
threaded (bool): Enable/disable threaded diffing. Needed to take advantage of database threads.

data_diff/__main__.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
import rich
1111
import click
1212

13-
from sqeleton.schema import create_schema
14-
from sqeleton.queries.api import current_timestamp
13+
from data_diff.sqeleton.schema import create_schema
14+
from data_diff.sqeleton.queries.api import current_timestamp
1515

1616
from .dbt import dbt_diff
1717
from .utils import eval_name_template, remove_password_from_url, safezip, match_like
@@ -217,15 +217,16 @@ def write_usage(self, prog: str, args: str = "", prefix: Optional[str] = None) -
217217
)
218218
@click.option(
219219
"--dbt-profiles-dir",
220+
envvar="DBT_PROFILES_DIR",
220221
default=None,
221222
metavar="PATH",
222-
help="Override the default dbt profile location (~/.dbt).",
223+
help="Which directory to look in for the profiles.yml file. If not set, we follow the default profiles.yml location for the dbt version being used. Can also be set via the DBT_PROFILES_DIR environment variable.",
223224
)
224225
@click.option(
225226
"--dbt-project-dir",
226227
default=None,
227228
metavar="PATH",
228-
help="Override the dbt project directory. Otherwise assumed to be the current directory.",
229+
help="Which directory to look in for the dbt_project.yml file. Default is the current working directory and its parents.",
229230
)
230231
def main(conf, run, **kw):
231232
if kw["table2"] is None and kw["database2"]:

data_diff/cloud/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .datafold_api import DatafoldAPI, TCloudApiDataDiff
2+
from .data_source import get_or_create_data_source

0 commit comments

Comments
 (0)