|
4 | 4 |
|
5 | 5 | # **data-diff**
|
6 | 6 |
|
7 |
| -## What is `data-diff`? |
8 |
| -data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables. |
9 |
| - |
10 |
| -## Documentation |
11 |
| - |
12 |
| -[**🗎 Documentation**](https://docs.datafold.com/guides/os_data_diff) - our detailed documentation has everything you need to start diffing. |
| 7 | +<h2 align="center"> |
| 8 | +Develop dbt models faster by testing as you code. |
| 9 | +</h2> |
| 10 | +<h4 align="center"> |
| 11 | +See how every change to dbt code affects the data produced in the modified model and downstream. |
| 12 | +</h4> |
| 13 | +<br> |
13 | 14 |
|
14 |
| -### Databases we support |
| 15 | +## What is `data-diff`? |
15 | 16 |
|
16 |
| -- PostgreSQL >=10 |
17 |
| -- MySQL |
18 |
| -- Snowflake |
19 |
| -- BigQuery |
20 |
| -- Redshift |
21 |
| -- Oracle |
22 |
| -- Presto |
23 |
| -- Databricks |
24 |
| -- Trino |
25 |
| -- Clickhouse |
26 |
| -- Vertica |
27 |
| -- DuckDB >=0.6 |
28 |
| -- SQLite (coming soon) |
| 17 | +data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code. |
29 | 18 |
|
30 |
| -For their corresponding connection strings, check out our [detailed table](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md). |
| 19 | +<div align="center"> |
31 | 20 |
|
32 |
| -#### Looking for a database not on the list? |
33 |
| -If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list. |
| 21 | + |
34 | 22 |
|
35 |
| -## Get started |
| 23 | +</div> |
36 | 24 |
|
37 |
| -### Installation |
| 25 | +<br> |
38 | 26 |
|
39 |
| -#### First, install `data-diff` using `pip`. |
| 27 | +## Getting Started |
40 | 28 |
|
| 29 | +**Install `data-diff`** |
41 | 30 | ```
|
42 | 31 | pip install data-diff
|
43 | 32 | ```
|
44 | 33 |
|
45 |
| -#### Then, install one or more driver(s) specific to the database(s) you want to connect to. |
46 |
| - |
47 |
| -- `pip install 'data-diff[mysql]'` |
48 |
| - |
49 |
| -- `pip install 'data-diff[postgresql]'` |
50 |
| - |
51 |
| -- `pip install 'data-diff[snowflake]'` |
52 |
| - |
53 |
| -- `pip install 'data-diff[presto]'` |
54 |
| - |
55 |
| -- `pip install 'data-diff[oracle]'` |
56 |
| - |
57 |
| -- `pip install 'data-diff[trino]'` |
58 |
| - |
59 |
| -- `pip install 'data-diff[clickhouse]'` |
60 |
| - |
61 |
| -- `pip install 'data-diff[vertica]'` |
62 |
| - |
63 |
| -- For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/ |
64 |
| - |
65 |
| -_Some drivers have dependencies that cannot be installed using `pip` and still need to be installed manually._ |
66 |
| - |
67 |
| -### Run your first diff |
68 |
| - |
69 |
| -Once you've installed `data-diff`, you can run it from the command line. |
70 |
| - |
| 34 | +**Update a few lines in your `dbt_project.yml`** |
71 | 35 | ```
|
72 |
| -data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS] |
| 36 | +#dbt_project.yml |
| 37 | +vars: |
| 38 | + data_diff: |
| 39 | + prod_database: my_database |
| 40 | + prod_schema: my_default_schema |
73 | 41 | ```
|
74 | 42 |
|
75 |
| -Be sure to read [the docs](https://docs.datafold.com/reference/open_source/cli) for detailed instructions how to build one of these commands depending on your database setup. |
76 |
| - |
77 |
| -#### Code Example: Diff Tables Between Databases |
78 |
| -Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres. |
| 43 | +**Run your first data diff!** |
79 | 44 |
|
80 | 45 | ```
|
81 |
| -data-diff \ |
82 |
| - postgresql://<username>:'<password>'@localhost:5432/<database> \ |
83 |
| - <table> \ |
84 |
| - "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \ |
85 |
| - <TABLE> \ |
86 |
| - -k activity_id \ |
87 |
| - -c activity \ |
88 |
| - -w "event_timestamp < '2022-10-10'" |
| 46 | +dbt run && data-diff --dbt |
89 | 47 | ```
|
90 | 48 |
|
91 |
| -#### Code Example: Diff Tables Within a Database |
| 49 | +We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details. |
92 | 50 |
|
93 |
| -``` |
94 |
| -data-diff \ |
95 |
| - "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \ |
96 |
| - <SCHEMA_2>.<TABLE_2> \ |
97 |
| - -k org_id \ |
98 |
| - -c created_at -c is_internal \ |
99 |
| - -w "org_id != 1 and org_id < 2000" \ |
100 |
| - -m test_results_%t \ |
101 |
| - --materialize-all-rows \ |
102 |
| - --table-write-limit 10000 |
103 |
| -``` |
104 |
| - |
105 |
| -In both code examples, I've used `<>` carrots to represent values that **should be replaced with your values** in the database connection strings. For the flags (`-k`, `-c`, etc.), I opted for "real" values (`org_id`, `is_internal`) to give you a more realistic view of what your command will look like. |
| 51 | +Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started! |
106 | 52 |
|
107 |
| -### We're here to help! |
| 53 | +<br><br> |
108 | 54 |
|
109 |
| -We're here to help! Please post any questions in [GitHub Discussions](https://github.com/datafold/data-diff/discussions). |
| 55 | +### Diffing between databases |
110 | 56 |
|
111 |
| -## How to Use |
| 57 | +Check out our [documentation](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md) if you're looking to compare data across databases (for example, between Postgres and Snowflake). |
112 | 58 |
|
113 |
| -* [Examples with dbt, joindiff, and hashdiff](https://docs.datafold.com/reference/open_source/cli#examples) |
114 |
| -* [Examples with Python](https://data-diff.readthedocs.io/en/latest/python-api.html) |
115 |
| -* [How to use with TOML configuration file](https://docs.datafold.com/reference/open_source/cli#toml-config-file) |
| 59 | +<br> |
116 | 60 |
|
117 |
| -## How to Contribute |
118 |
| -* Feel free to open an issue or contribute to the project by working on an existing issue. |
119 |
| -* Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started. |
120 |
| -* To add a new database driver, check out [docs](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst). |
| 61 | +## Contributors |
121 | 62 |
|
122 |
| -Big thanks to everyone who contributed so far: |
| 63 | +We thank everyone who contributed so far! |
123 | 64 |
|
124 | 65 | <a href="https://github.com/datafold/data-diff/graphs/contributors">
|
125 | 66 | <img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
|
126 | 67 | </a>
|
127 | 68 |
|
128 |
| -## Technical Explanation |
129 |
| - |
130 |
| -Check out this [technical explanation](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) of how data-diff works. |
| 69 | +<br> |
131 | 70 |
|
132 | 71 | ## Analytics
|
| 72 | + |
133 | 73 | * [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)
|
134 | 74 |
|
| 75 | +<br> |
| 76 | + |
135 | 77 | ## License
|
136 | 78 |
|
137 | 79 | This project is licensed under the terms of the [MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE).
|
0 commit comments