@@ -15,78 +15,216 @@ Path Syntax Examples
15
15
Overview
16
16
--------
17
17
18
+ When you query documents in your |s3| buckets, the {+adl+}
19
+ :datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`
20
+ value allows {+dl+} to map the data inside your document to the
21
+ filename of the document.
22
+
23
+ :datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`
24
+ supports parsing filenames in |s3| buckets into computed fields.
25
+ {+data-lake-short+} can add the computed fields to each document
26
+ generated from the parsed file. {+data-lake-short+} can target
27
+ queries on those computed field values to only those file(s) with
28
+ a matching file name. See :ref:`supported-parsing-funcs` and
29
+ :ref:`datalake-path-syntax-egs` for more information.
30
+
18
31
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`
19
- supports parsing filenames into computed fields. {+data-lake-short+}
20
- can add the computed fields to each document generated from the
21
- parsed file. {+data-lake-short+} can target queries on those
22
- computed field values to only those file(s) with a matching file name.
32
+ also supports creating partitions using partition attributes in the
33
+ path to the file. {+data-lake-short+} can target queries on the
34
+ parameter defined in the partition attribute to only those files
35
+ that contain the query in the filename or partition prefix.
36
+
37
+ .. example::
38
+
39
+ Consider the following files in your |s3| bucket:
40
+
41
+ .. code-block:: sh
42
+ :copyable: false
43
+
44
+ /users/1234.json
45
+ /users/5678.json
46
+
47
+ The |json| document ``1234.json`` contains the following:
48
+
49
+ .. code-block:: json
50
+ :copyable: false
51
+
52
+ {
53
+ "name": "jane doe",
54
+ "age": 26,
55
+ "userID": "1234"
56
+ }
57
+
58
+ Your {+dl+} configuration for the files in your |s3| bucket defines
59
+ the following ``path``:
60
+
61
+ .. code-block:: sh
62
+ :copyable: false
63
+
64
+ "path": "/users/{age int}/{userID string}"
65
+
66
+ The following shows how {+dl+} maps a query to the
67
+ partitions created from the ``path`` definition:
68
+
69
+ .. code-block:: json
70
+ :copyable: false
71
+
72
+ db.users.findOne( /users
73
+ { /40
74
+ "age": 26 -----------------> /26
75
+ "userID": "1234" ----------> /1234.json
76
+ } /5678.json
77
+ )
78
+
79
+ If the computed field for the partition attribute already exists
80
+ in your document, {+dl+} maps your query to the appropriate file.
81
+ If the computed field does not exist, {+dl+} adds the computed
82
+ field to the document. For example, if the ``age`` field does not
83
+ exist in ``1234.json``, {+dl+} adds the ``age`` field and value to
84
+ ``1234.json``.
85
+
86
+ .. _supported-parsing-funcs:
87
+
88
+ Supported Parsing Functions
89
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
23
90
24
- - You can specify a single parsing function on the filename:
91
+ .. list-table::
92
+ :widths: 30 70
25
93
26
- .. code-block:: none
27
- :copyable: false
94
+ * - You can specify a single parsing function on the filename.
95
+ - .. code-block:: none
96
+ :copyable: false
28
97
29
- /path/to/files/{<fieldA> <data-type>}
98
+ /path/to/files/{<fieldA> <data-type>}
30
99
31
- - You can specify multiple parsing functions on the filename:
100
+ * - You can specify multiple parsing functions on the filename.
101
+ - .. code-block:: none
102
+ :copyable: false
32
103
33
- .. code-block:: none
34
- :copyable: false
104
+ /path/to/files/{<fieldA> <data-type>}-{<fieldB> <data-type>}
35
105
36
- /path/to/files/{<fieldA> <data-type>}-{<fieldB> <data-type>}
106
+ * - You can specify parsing functions alongside static strings in the
107
+ filename:
108
+ - .. code-block:: none
109
+ :copyable: false
37
110
38
- - You can specify parsing functions alongside static strings
39
- in the filename:
111
+ /path/to/files/prefix-{<fieldA> <data-type>}-suffix
40
112
41
- .. code-block:: none
42
- :copyable: false
43
-
44
- /path/to/files/prefix-{<fieldA> <data-type>}-suffix
113
+ * - You can specify dot (i.e. ``.``) along the path to the filename.
114
+ - .. code-block:: none
115
+ :copyable: false
45
116
46
- - You can specify dot (i.e. ``.``) along the path to the filename:
117
+ /path/to/files/{<fieldA>.<fieldB> <data-type>}
47
118
48
- .. code-block:: none
49
- :copyable: false
119
+ * - You can specify ``ObjectIds`` in the path to the files to create
120
+ partitions.
121
+ - .. code-block:: none
122
+ :copyable: false
50
123
51
- /path/to/files/{<fieldA>.<fieldB> <data-type> }
124
+ /path/to/files/{objid objectid }
52
125
53
- - You can specify ``ObjectIds`` in the path to the files to create
54
- partitions:
126
+ * - You can specify a range of ``ObjectIds`` in the path to the files to
127
+ create partitions.
128
+ - .. code-block:: none
129
+ :copyable: false
55
130
56
- .. code-block:: none
57
- :copyable: false
131
+ /path/to/files/{min(obj) objectid}-{max(obj) objectid}
58
132
59
- /path/to/files/{objid objectid}
133
+ * - You can specify parsing functions along the path to the filename.
134
+ - .. code-block:: none
135
+ :copyable: false
60
136
61
- - You can specify a range of ``ObjectIds`` in the path to the files
62
- to create partitions:
137
+ /path/{<fieldA> <data-type>}/{<fieldB> <data-type>}/{<fieldC> <data-type>}/*
63
138
64
- .. code-block:: none
65
- :copyable: false
139
+ .. note::
66
140
67
- /path/to/files/{min(obj) objectid}-{max(obj) objectid}
141
+ .. include:: /includes/fact-path-delimiter.rst
68
142
69
- - You can specify parsing functions along the path to the filename :
143
+ .. _parse-null-values :
70
144
71
- .. code-block:: none
72
- :copyable: false
145
+ Parsing Null Values from Filenames
146
+ ``````````````````````````````````
73
147
74
- /path/{<fieldA> <data-type>}/{<fieldB> <data-type>}/{<fieldC> <data-type>}/*
148
+ {+dl+} automatically parses an empty string (``""``) in the place of an
149
+ attribute in the file path as the BSON null value for all the {+adl+} attribute
150
+ types except ``string``. With a ``string``, empty string could either represent
151
+ a BSON null value or a BSON empty string value. {+adl+} does not parse any BSON
152
+ value for ``string`` attribute type. This avoids adding a BSON value with a
153
+ conflicting type to documents read from |s3|.
75
154
76
- The default data type for partition attributes in the
77
- :datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path` is string.
78
- If you omit the data type, defaults to string. For example, suppose a path similar
79
- to the following:
155
+ .. example::
156
+
157
+ Consider the following |s3| {+data-lake-store+}:
80
158
81
159
.. code-block:: none
82
- :copyable: false
160
+ :emphasize-lines: 3
161
+
162
+ /records/january/1.json
163
+ /records/february/1.json
164
+ /records//1.json
165
+
166
+ For the path ``/records/{month string}/*``, {+dl+} does not add any
167
+ computed fields for the ``month`` attribute to documents generated
168
+ from the third record in the above store.
169
+
170
+ .. note::
171
+
172
+ When writing files to |s3|, write |bson| null values as empty
173
+ strings in filenames for all {+adl+} attribute types.
174
+
175
+ .. _parse-padded-numeric-values:
176
+
177
+ Parsing Padded Numbers from Filenames
178
+ `````````````````````````````````````
179
+
180
+ File path can include numeric values that are padded with leading zeros. For
181
+ {+dl+} to correctly parse padded numeric values for attribute types like
182
+ ``int``, ``epoch_millis``, and ``epoch_secs``, specify the number of digits in
183
+ the value using regular expressions.
184
+
185
+ .. example::
186
+
187
+ Consider a |s3| store with the following files:
188
+
189
+ .. code-block:: text
190
+ :copyable: false
191
+
192
+ |--users
193
+ |--001.json
194
+ |--002.json
195
+ ...
196
+
197
+ The following ``path`` syntax uses a regular expression to specify the
198
+ number of digits in the filename. {+dl+} identifies the portion of the
199
+ path that corresponds to the partition attribute and then maps that
200
+ partition attribute to a type ``int``:
201
+
202
+ .. code-block:: sh
203
+ :copyable: false
204
+
205
+ /users/{user_id int:\\d{3}}
206
+
207
+ Default Partition Attribute Type
208
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
83
209
84
- /employees/{startDate}
210
+ The partition attributes in the
211
+ :datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path` defaults to
212
+ string if you don't set a different data type.
85
213
86
- In the above example, ``startDate`` is interpreted as a string. For more
87
- information on all supported data types, see :ref:`datalake-path-attribute-types`.
214
+ .. example::
88
215
89
- .. include:: /includes/fact-path-delimiter.rst
216
+ Suppose a path similar to the following:
217
+
218
+ .. code-block:: none
219
+ :copyable: false
220
+
221
+ /employees/{startDate}
222
+
223
+ ``startDate`` is interpreted as a string.
224
+
225
+ .. seealso::
226
+
227
+ :ref:`datalake-path-attribute-types`
90
228
91
229
.. _datalake-path-syntax-egs:
92
230
@@ -573,4 +711,44 @@ results in the following collections:
573
711
- UltraSoftware
574
712
- MegaSoftware
575
713
714
+ Or, consider a {+data-lake-store+} ``accountingArchive`` with the
715
+ following files:
716
+
717
+ .. code-block:: text
718
+ :copyable: false
719
+
720
+ /orders/MONGODB-invoices-jan.json
721
+ /orders/MONGODB-purchaseOrders-jan.json
722
+ /orders/MONGODB-invoices-feb.json
723
+ ...
724
+
725
+ The following :ref:`datalake-databases-reference` object
726
+ generates a dynamic collection name from the file path:
727
+
728
+ .. code-block:: json
729
+ :copyable: false
730
+
731
+ "databases" : [
732
+ {
733
+ "name" : "invoices",
734
+ "collections" : [
735
+ {
736
+ "name" : "*",
737
+ "dataSources" : [
738
+ {
739
+ "storeName" : "accountingArchive",
740
+ "path" : "/orders/MONGODB-{collectionName()}/{invoiceMonth string}.json/"
741
+ }
742
+ ]
743
+ }
744
+ ]
745
+ }
746
+ ]
747
+
748
+ When applied to the example filenames, the path
749
+ results in the following collections:
750
+
751
+ - ``invoices``
752
+ - ``purchaseOrders``
753
+
576
754
.. include:: /includes/fact-data-lake-dynamic-collections.rst
0 commit comments