44 <img src =" img/motto.png " alt =" drawing " width =" 450 " />
55</div >
66
7- Data as Code (DaC) is a paradigm of distributing versioned data as versioned code.
7+ Data as Code (DaC) is a paradigm of distributing versioned data as code. Think of it as treating your data with the same
8+ care and precision as your software.
89
910!!! warning "Disclaimer"
1011
11- Currently the focus is on tabular and batch data, and Python code only .
12+ At the moment, we're focusing on tabular and batch data, with Python as the primary language .
1213
13- Future extensions may be possible, depending on the community interest.
14+ But who knows? With enough community interest, we might expand to other areas in the future!
1415
1516## Consumer - Data Scientist
1617
1718??? info "Follow along"
1819
19- The code snippets below can be executed on your machine too! You just need to configure `pip` to point to the pypi
20- registry where we stored the example DaC package. You can do this by running
20+ Want to try the examples below on your own machine? It's easy! Just configure `pip` to point to the PyPI registry where
21+ the example DaC package is stored. Run this command:
2122
2223 ```shell
2324 ❯ export PIP_EXTRA_INDEX_URL=https://gitlab.com/api/v4/projects/43746775/packages/pypi/simple
2425 ```
2526
26- Of course don't forget to create an isolated environment before using `pip` to install the package:
27+ And don't forget to create an isolated environment before installing the package:
2728
2829 ```shell
2930 ❯ python -m venv venv && . venv/bin/activate
3031 ```
3132
32- Say that the Data Engineers prepared the DaC ` dac-example-energy ` for you. Install it with
33+ Imagine the Data Engineers have prepared a DaC package called ` dac-example-energy ` just for you. Install it like this:
3334
3435``` shell
3536❯ python -m pip install dac-example-energy
3637...
3738Successfully installed ... dac-example-energy-2.0.2 ...
3839```
3940
40- Have you noticed the version ` 2.0.2 ` ? That is the version of your data. This is very important, you can read
41- [ here ] ( #make-releases ) why .
41+ Notice the version ` 2.0.2 ` ? That’s the version of your data! Curious to know more about the importance of the version?
42+ Check out [ this section ] ( #make-releases ) .
4243
43- Now, how do you grab the data?
44+ ### Grab the data with the snap of a finger: ` load `
45+
46+ Now, let’s grab the data:
4447
4548``` python
4649>> > from dac_example_energy import load
@@ -62,12 +65,13 @@ Now, how do you grab the data?
6265[71160 rows x 5 columns]
6366```
6467
65- One more very valuable tool is the ` Schema ` class. ` Schema ` is the implementation of a Data Contract, which is a
66- contract between the data producer and the data consumer. It describes the data, its structure, and the constraints that
67- the data must fulfill. At the very least it will have a ` validate ` method that verifies if a given data set fulfills the
68- data contract. Loaded data is guaranteed to pass the validation.
68+ ### Meet the ` Schema ` Class: Your Data’s Best Friend
69+
70+ The ` Schema ` class is the backbone of the Data Contract. It’s a promise between the data producer and the data consumer.
71+ It defines the structure, constraints, and expectations for the data. And here’s the best part: any data you load is
72+ guaranteed to pass validation.
6973
70- Let us see what we can do with the ` Schema ` in the ` dac-example-energy ` package.
74+ Let’s explore what the ` Schema ` in the ` dac-example-energy ` package can do:
7175
7276``` python
7377>> > from dac_example_energy import Schema
@@ -119,9 +123,9 @@ class Schema(pa.SchemaModel):
119123 )
120124```
121125
122- In this case [ ` pandera ` ] ( https://pandera.readthedocs.io/en/stable/index.html ) has been used to define the Schema. We can
126+ This ` Schema ` is built using [ ` pandera ` ] ( https://pandera.readthedocs.io/en/stable/index.html ) . Here’s why it’s awesome:
123127
124- - see which columns are available and even reference their names in our code without cumbersome hardcoded strings :
128+ - ** Column names are accessible ** : No more hardcoding strings! Reference column names directly in your code :
125129 ``` python
126130 >> > df[Schema.value_in_gwh]
127131 0 6644.088
@@ -137,62 +141,55 @@ In this case [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) ha
137141 71159 0.000
138142 Name: OBS_VALUE , Length: 71160 , dtype: float64
139143 ```
140- - for each column, we exactly know what to expect. For example, what is the column type , are `None ` values allowed, are
141- there specific admitted categorical values, etc.;
142- - we can read a useful description of the column;
143- - if we install `pandera[strategies]` with `pip` , we can even generate synthetic data that is guaranteed to pass the
144- schema validation. This is very useful for testing our code:
145-
146- ```python
147- >> > Schema.example(size = 5 )
148- siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
149- 0 Natural gas Gross available energy AL 1990 0.0
150- 1 Solid fossil fuels Gross available energy AL 1990 0.0
151- 2 Solid fossil fuels Gross available energy AL 1990 0.0
152- 3 Solid fossil fuels Gross available energy AL 1990 0.0
153- 4 Solid fossil fuels Gross available energy AL 1990 0.0
154- ```
144+ - ** Clear expectations** : Know exactly what each column should contain—types, constraints, and more.
145+ - ** Self- documenting** : Each column comes with a description.
146+ - ** Synthetic data generation** : Install `pandera[strategies]` and generate test data that passes validation:
147+ ```python
148+ >> > Schema.example(size = 5 )
149+ siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
150+ 0 Natural gas Gross available energy AL 1990 0.0
151+ 1 Solid fossil fuels Gross available energy AL 1990 0.0
152+ 2 Solid fossil fuels Gross available energy AL 1990 0.0
153+ 3 Solid fossil fuels Gross available energy AL 1990 0.0
154+ 4 Solid fossil fuels Gross available energy AL 1990 0.0
155+ ```
155156
156- !!! hint " Example data does not look right "
157+ !!! hint " Example data looks odd? "
157158
158- The example data above does not look right . Does this mean that there is something wrong in the implementation of the
159- `example` method ? Not really! Read [here ](# nice-to-have-schemaexample-method).
159+ The synthetic data above might not look realistic . Does this mean the `example` method is broken ? Not at all ! Check out
160+ [this section ](# nice-to-have-schemaexample-method) to learn more .
160161
161162# # Producer - Data Engineer
162163
163- Data as Code is a paradigm, it does not require any special tool or library. Anyone is free to implement it in his own
164- way and , by the way, may do so in programming languages other than Python. The tools that we built and describe below
165- (template and CLI tool `dac` ) are just ** convenience** tools, meaning that they may accelerate your development process,
166- but are not strictly necessary.
164+ Data as Code is a paradigm, not a tool. You can implement it however you like, in any language. That said, we’ve built
165+ some handy tools to make your life easier.
167166
168- !!! hint " Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) to define the Schema "
167+ !!! hint " Pro Tip: Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) for defining schemas "
169168
170- If the dataframe engine (pandas/ polars/ dask/ spark... ) you are using is supported by
171- [`pandera` ](https:// pandera.readthedocs.io/ en/ stable/ index.html), consider using a
172- [`DataFrameModel` ](https:// pandera.readthedocs.io/ en/ stable/ dataframe_models.html) to define the Schema.
169+ If your dataframe engine (pandas, polars, dask, spark, etc.) is supported by `pandera` , consider using a
170+ [`DataFrameModel` ](https:// pandera.readthedocs.io/ en/ stable/ dataframe_models.html) to define your schema.
173171
174172# ## Write the library
175173
176174# ### 1. Start from scratch
177175
178- !!! warning " This approach expects you to be familiar with python packaging"
176+ !!! warning " This approach requires Python packaging knowledge "
179177
180- Build you own library, respecting the following constraints :
178+ Build your own library while following these guidelines :
181179
182180# #### Public function `load`
183181
184- A public function named `load` is available at the root of package . For example, if you build the package
185- `dac- my- awesome- data` , it should be possible to do the following :
182+ Your package must have a public function named `load` at its root. For example, if your package is
183+ `dac- my- awesome- data` , users should be able to do this :
186184
187185```python
188186>> > from dac_my_awesome_data import load
189187>> > df = load()
190188```
191189
192- Notice that it must be possible to call `load()` without any argument, and the version of the returned data must
193- correspond to the version of the package. This means that data will be different at every build.
190+ The `load()` function should return data corresponding to the package version. Each build should produce different data.
194191
195- # #### Data fulfill the `Schema.validate()` method
192+ # #### Data must pass `Schema.validate()`
196193
197194A public class named `Schema` is available at the root of package and implements the Data Contract. `Schema` has a
198195`validate` method which takes data as input and raises an error if the Contract is not fulfilled, and returns the data
@@ -222,7 +219,7 @@ It is possible to reference the column names from the `Schema` class. For exampl
222219
223220# #### [Nice to have] `Schema.example` method
224221
225- It is possible to generate synthetic data that fulfill the Data Contract. For example :
222+ Provide a method to generate synthetic data that fulfills the Data Contract:
226223
227224```python
228225>> > from dac_my_awesome_data import Schema
@@ -245,19 +242,19 @@ be probably a good idea to add meaningful constraint checks to the `Schema` clas
245242
246243# ### 2. Use the template
247244
248- We provide a [Copier](https:// copier.readthedocs.io/ en/ stable/ ) template to get started quickly.
245+ We’ve created a [Copier](https:// copier.readthedocs.io/ en/ stable/ ) template to help you get started quickly.
249246
250- [Take me to the template :material- cursor- default- click:](https:// gitlab.com/ data- as - code/ template/ src){ .md- button }
247+ [Check out the template :material- cursor- default- click:](https:// gitlab.com/ data- as - code/ template/ src){ .md- button }
251248
252249# ### 3. Use the [`dac`](https://github.com/data-as-code/dac) CLI tool
253250
254- Out CLI `dac` tool is a convenience tool capable of building a python package that respects the Data as Code paradigm.
251+ Our `dac` CLI tool simplifies building Python packages that follow the Data as Code paradigm.
255252
256- [Take me to the `dac` CLI tool :material- cursor- default- click:](https:// github.com/ data- as - code/ dac){ .md- button }
253+ [Explore the `dac` CLI tool :material- cursor- default- click:](https:// github.com/ data- as - code/ dac){ .md- button }
257254
258- # ### Compare template and `dac pack`
255+ # ### Template vs. `dac pack`
259256
260- Which one should you use ?
257+ Which one should you choose ?
261258
262259| | Template | `dac pack` |
263260| :-------------------------- : | :---------------------- - : | :------------------ - : |
@@ -268,34 +265,28 @@ Which one should you use?
268265
269266Choosing the right release version plays a crucial role in the Data as Code paradigm.
270267
271- Semantic versioning is used to communicate significant changes:
272-
273- | | Reason |
274- | :------ - : | :---------------------------------------------------------------------------------- - |
275- | __Patch__ | Fix in the data. Intended content unchanged |
276- | __Minor__ | Change in the data that does not break the Data Contract - typically, new batch data |
277- | __Major__ | Change in the Data Contract, or any other breaking change |
268+ | | When to Use |
269+ | :------ - : | :---------------------------------------------------------- - |
270+ | __Patch__ | Fixes in the data without changing its intended content |
271+ | __Minor__ | Non- breaking changes, like a fresh version of the batch data |
272+ | __Major__ | Breaking changes, such as changes to the Data Contract |
278273
279- __Typically Patch and Major releases involve a manual process, while Minor releases can be automated.__ Our
280- [`dac` CLI tool](https:// github.com/ data- as - code/ dac) can help you with the automated releases. Explore the command
281- `dac next - version` .
274+ __Patch and Major releases are usually manual, while Minor releases can be automated.__ Use the `dac` CLI tool to
275+ automate Minor releases with the `dac next - version` command.
282276
283277# # Why distributing Data as Code?
284278
285- - __Data Scientists have a very convenient way to ensure that their code is not run on incompatible data.__ They can
286- simply include the Data as Code into their dependencies (e.g. `dac- example- energy~ =1.0 ` ), and then installation of
287- their code with incompatible data will fail (e.g. `dac- example- energy` version `2.0 .0` ).
288- - __Data pipelines can receive data updates without breaking.__ Data can subscribe to a major version of the data, and
289- can receive updates without breaking changes.
290- - __It provides a way to maintain multiple release streams__ (e.g. `1. X.Y` and `2. X.Y` ). This is useful when a new
291- version of the data is released, but some users are still using the old version. In this case, the data engineer can
292- keep releasing updates for both versions, until all users have migrated to the new version.
293- - __The code needed to load the data, the data source, and locations are abstracted away from the consumer.__ This mean
294- that the Data Producer can start from local files, transition to SQL database, cloud file storage, or kafka topic,
295- without having the consumer to notice it or need to adapt its code.
296- - _If you provide column names in `Schema` _ (e.g. `Schema.column_1` ), __the consumer' s code will not contain hard-coded
297- column names__, and changes in data source field names won' t impact the user.
298- - _If you provide the `Schema.example` method_, __the consumer will be able to build robust code by writing unit testing
299- for their functions__. This will result in a more robust data pipeline.
300- - _If the description of the data and columns is included in the `Schema` _, __data will be self - documented, from a
301- consumer perspective__.
279+ - ** Seamless compatibility** : Data Scientists can ensure their code runs on compatible data by including the Data as
280+ Code package as a dependency to their code. For example, if they add `dac- example- energy~ =1.0 ` to the dependencies,
281+ it will not be possible to use it together with `dac- example- energy== 2.0 .0` .
282+ - ** Smooth updates** : Data pipelines can receive updates without breaking, as long as they subscribe to a major version.
283+ - ** Multiple release streams** : Maintain different versions (e.g., `1. X.Y` and `2. X.Y` ) to support users on older
284+ versions.
285+ - ** Abstracted complexity** : Data loading, sources, and locations are hidden from consumers, allowing producers to
286+ change implementations without impacting users.
287+ - ** No hardcoded column names** : * If column names are included in the `Schema` * , consumers can avoid hardcoding field
288+ names.
289+ - ** Robust testing** : * If the `Schema.example` method is provided* , it enables consumers to write strong unit tests for
290+ their code.
291+ - ** Self- documenting data** : * If data and column descriptions are included in the `Schema` * , data will be easier to
292+ understand for consumers.
0 commit comments