Skip to content

Commit 9d0a83c

Browse files
DOCSP-10333 Updates to partition attribute doc (#41)
* DOCSP-10333 Updates to partition attribute doc
1 parent 9dddfd0 commit 9d0a83c

File tree

3 files changed

+299
-112
lines changed

3 files changed

+299
-112
lines changed
Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
.. note::
2-
3-
When specifying the
4-
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`, use the
5-
delimiter specified in :datalakeconf:`~stores.[n].delimiter`.
1+
When specifying the
2+
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`:
3+
4+
- Ensure that the partition attribute type matches the data type to parse.
5+
- Use the delimiter specified in :datalakeconf:`~stores.[n].delimiter`.

source/reference/examples/path-syntax-examples.txt

Lines changed: 223 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -15,78 +15,216 @@ Path Syntax Examples
1515
Overview
1616
--------
1717

18+
When you query documents in your |s3| buckets, the {+adl+}
19+
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`
20+
value allows {+dl+} to map the data inside your document to the
21+
filename of the document.
22+
23+
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`
24+
supports parsing filenames in |s3| buckets into computed fields.
25+
{+data-lake-short+} can add the computed fields to each document
26+
generated from the parsed file. {+data-lake-short+} can target
27+
queries on those computed field values to only those file(s) with
28+
a matching file name. See :ref:`supported-parsing-funcs` and
29+
:ref:`datalake-path-syntax-egs` for more information.
30+
1831
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`
19-
supports parsing filenames into computed fields. {+data-lake-short+}
20-
can add the computed fields to each document generated from the
21-
parsed file. {+data-lake-short+} can target queries on those
22-
computed field values to only those file(s) with a matching file name.
32+
also supports creating partitions using partition attributes in the
33+
path to the file. {+data-lake-short+} can target queries on the
34+
parameter defined in the partition attribute to only those files
35+
that contain the query in the filename or partition prefix.
36+
37+
.. example::
38+
39+
Consider the following files in your |s3| bucket:
40+
41+
.. code-block:: sh
42+
:copyable: false
43+
44+
/users/1234.json
45+
/users/5678.json
46+
47+
The |json| document ``1234.json`` contains the following:
48+
49+
.. code-block:: json
50+
:copyable: false
51+
52+
{
53+
"name": "jane doe",
54+
"age": 26,
55+
"userID": "1234"
56+
}
57+
58+
Your {+dl+} configuration for the files in your |s3| bucket defines
59+
the following ``path``:
60+
61+
.. code-block:: sh
62+
:copyable: false
63+
64+
"path": "/users/{age int}/{userID string}"
65+
66+
The following shows how {+dl+} maps a query to the
67+
partitions created from the ``path`` definition:
68+
69+
.. code-block:: json
70+
:copyable: false
71+
72+
db.users.findOne( /users
73+
{ /40
74+
"age": 26 -----------------> /26
75+
"userID": "1234" ----------> /1234.json
76+
} /5678.json
77+
)
78+
79+
If the computed field for the partition attribute already exists
80+
in your document, {+dl+} maps your query to the appropriate file.
81+
If the computed field does not exist, {+dl+} adds the computed
82+
field to the document. For example, if the ``age`` field does not
83+
exist in ``1234.json``, {+dl+} adds the ``age`` field and value to
84+
``1234.json``.
85+
86+
.. _supported-parsing-funcs:
87+
88+
Supported Parsing Functions
89+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
2390

24-
- You can specify a single parsing function on the filename:
91+
.. list-table::
92+
:widths: 30 70
2593

26-
.. code-block:: none
27-
:copyable: false
94+
* - You can specify a single parsing function on the filename.
95+
- .. code-block:: none
96+
:copyable: false
2897

29-
/path/to/files/{<fieldA> <data-type>}
98+
/path/to/files/{<fieldA> <data-type>}
3099

31-
- You can specify multiple parsing functions on the filename:
100+
* - You can specify multiple parsing functions on the filename.
101+
- .. code-block:: none
102+
:copyable: false
32103

33-
.. code-block:: none
34-
:copyable: false
104+
/path/to/files/{<fieldA> <data-type>}-{<fieldB> <data-type>}
35105

36-
/path/to/files/{<fieldA> <data-type>}-{<fieldB> <data-type>}
106+
* - You can specify parsing functions alongside static strings in the
107+
filename:
108+
- .. code-block:: none
109+
:copyable: false
37110

38-
- You can specify parsing functions alongside static strings
39-
in the filename:
111+
/path/to/files/prefix-{<fieldA> <data-type>}-suffix
40112

41-
.. code-block:: none
42-
:copyable: false
43-
44-
/path/to/files/prefix-{<fieldA> <data-type>}-suffix
113+
* - You can specify dot (i.e. ``.``) along the path to the filename.
114+
- .. code-block:: none
115+
:copyable: false
45116

46-
- You can specify dot (i.e. ``.``) along the path to the filename:
117+
/path/to/files/{<fieldA>.<fieldB> <data-type>}
47118

48-
.. code-block:: none
49-
:copyable: false
119+
* - You can specify ``ObjectIds`` in the path to the files to create
120+
partitions.
121+
- .. code-block:: none
122+
:copyable: false
50123

51-
/path/to/files/{<fieldA>.<fieldB> <data-type>}
124+
/path/to/files/{objid objectid}
52125

53-
- You can specify ``ObjectIds`` in the path to the files to create
54-
partitions:
126+
* - You can specify a range of ``ObjectIds`` in the path to the files to
127+
create partitions.
128+
- .. code-block:: none
129+
:copyable: false
55130

56-
.. code-block:: none
57-
:copyable: false
131+
/path/to/files/{min(obj) objectid}-{max(obj) objectid}
58132

59-
/path/to/files/{objid objectid}
133+
* - You can specify parsing functions along the path to the filename.
134+
- .. code-block:: none
135+
:copyable: false
60136

61-
- You can specify a range of ``ObjectIds`` in the path to the files
62-
to create partitions:
137+
/path/{<fieldA> <data-type>}/{<fieldB> <data-type>}/{<fieldC> <data-type>}/*
63138

64-
.. code-block:: none
65-
:copyable: false
139+
.. note::
66140

67-
/path/to/files/{min(obj) objectid}-{max(obj) objectid}
141+
.. include:: /includes/fact-path-delimiter.rst
68142

69-
- You can specify parsing functions along the path to the filename:
143+
.. _parse-null-values:
70144

71-
.. code-block:: none
72-
:copyable: false
145+
Parsing Null Values from Filenames
146+
``````````````````````````````````
73147

74-
/path/{<fieldA> <data-type>}/{<fieldB> <data-type>}/{<fieldC> <data-type>}/*
148+
{+dl+} automatically parses an empty string (``""``) in the place of an
149+
attribute in the file path as the BSON null value for all the {+adl+} attribute
150+
types except ``string``. With a ``string``, empty string could either represent
151+
a BSON null value or a BSON empty string value. {+adl+} does not parse any BSON
152+
value for ``string`` attribute type. This avoids adding a BSON value with a
153+
conflicting type to documents read from |s3|.
75154

76-
The default data type for partition attributes in the
77-
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path` is string.
78-
If you omit the data type, defaults to string. For example, suppose a path similar
79-
to the following:
155+
.. example::
156+
157+
Consider the following |s3| {+data-lake-store+}:
80158

81159
.. code-block:: none
82-
:copyable: false
160+
:emphasize-lines: 3
161+
162+
/records/january/1.json
163+
/records/february/1.json
164+
/records//1.json
165+
166+
For the path ``/records/{month string}/*``, {+dl+} does not add any
167+
computed fields for the ``month`` attribute to documents generated
168+
from the third record in the above store.
169+
170+
.. note::
171+
172+
When writing files to |s3|, write |bson| null values as empty
173+
strings in filenames for all {+adl+} attribute types.
174+
175+
.. _parse-padded-numeric-values:
176+
177+
Parsing Padded Numbers from Filenames
178+
`````````````````````````````````````
179+
180+
File path can include numeric values that are padded with leading zeros. For
181+
{+dl+} to correctly parse padded numeric values for attribute types like
182+
``int``, ``epoch_millis``, and ``epoch_secs``, specify the number of digits in
183+
the value using regular expressions.
184+
185+
.. example::
186+
187+
Consider a |s3| store with the following files:
188+
189+
.. code-block:: text
190+
:copyable: false
191+
192+
|--users
193+
|--001.json
194+
|--002.json
195+
...
196+
197+
The following ``path`` syntax uses a regular expression to specify the
198+
number of digits in the filename. {+dl+} identifies the portion of the
199+
path that corresponds to the partition attribute and then maps that
200+
partition attribute to a type ``int``:
201+
202+
.. code-block:: sh
203+
:copyable: false
204+
205+
/users/{user_id int:\\d{3}}
206+
207+
Default Partition Attribute Type
208+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
83209

84-
/employees/{startDate}
210+
The partition attributes in the
211+
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path` defaults to
212+
string if you don't set a different data type.
85213

86-
In the above example, ``startDate`` is interpreted as a string. For more
87-
information on all supported data types, see :ref:`datalake-path-attribute-types`.
214+
.. example::
88215

89-
.. include:: /includes/fact-path-delimiter.rst
216+
Suppose a path similar to the following:
217+
218+
.. code-block:: none
219+
:copyable: false
220+
221+
/employees/{startDate}
222+
223+
``startDate`` is interpreted as a string.
224+
225+
.. seealso::
226+
227+
:ref:`datalake-path-attribute-types`
90228

91229
.. _datalake-path-syntax-egs:
92230

@@ -573,4 +711,44 @@ results in the following collections:
573711
- UltraSoftware
574712
- MegaSoftware
575713

714+
Or, consider a {+data-lake-store+} ``accountingArchive`` with the
715+
following files:
716+
717+
.. code-block:: text
718+
:copyable: false
719+
720+
/orders/MONGODB-invoices-jan.json
721+
/orders/MONGODB-purchaseOrders-jan.json
722+
/orders/MONGODB-invoices-feb.json
723+
...
724+
725+
The following :ref:`datalake-databases-reference` object
726+
generates a dynamic collection name from the file path:
727+
728+
.. code-block:: json
729+
:copyable: false
730+
731+
"databases" : [
732+
{
733+
"name" : "invoices",
734+
"collections" : [
735+
{
736+
"name" : "*",
737+
"dataSources" : [
738+
{
739+
"storeName" : "accountingArchive",
740+
"path" : "/orders/MONGODB-{collectionName()}/{invoiceMonth string}.json/"
741+
}
742+
]
743+
}
744+
]
745+
}
746+
]
747+
748+
When applied to the example filenames, the path
749+
results in the following collections:
750+
751+
- ``invoices``
752+
- ``purchaseOrders``
753+
576754
.. include:: /includes/fact-data-lake-dynamic-collections.rst

0 commit comments

Comments
 (0)