Skip to content

Commit 0d805e6

Browse files
kanchana-mongodbChris Chodgolub
authored
DOCSP-13598 out for Formats with Schema on ADL (#102)
* DOCSP-13598 out for Formats with Schema on ADL * DOCSP-13598 updates for feedback * Apply suggestions from code review Co-authored-by: Chris Cho <[email protected]> * Apply suggestions from code review Co-authored-by: Chris Cho <[email protected]> Co-authored-by: David Golub <[email protected]> * DOCSP-13598 updates for review feedback * Apply suggestions from code review Co-authored-by: David Golub <[email protected]> Co-authored-by: Chris Cho <[email protected]> Co-authored-by: David Golub <[email protected]>
1 parent 85d9842 commit 0d805e6

File tree

1 file changed

+193
-7
lines changed

1 file changed

+193
-7
lines changed

source/reference/pipeline/out.txt

Lines changed: 193 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,12 @@ Syntax
9494
"region": "<aws-region>",
9595
"filename": "<file-name>",
9696
"format": {
97-
"name": "json|json.gz|bson|bson.gz",
98-
"maxFileSize": "<file-size>"
99-
}
97+
"name": "<file-format>",
98+
"maxFileSize": "<file-size>",
99+
"maxRowGroupSize": "<row-group-size>",
100+
"columnCompression": "<compression-type>"
101+
},
102+
"errorMode": "stop"|"continue"
100103
}
101104
}
102105
}
@@ -212,10 +215,19 @@ Fields
212215
- Format of the file in |s3|. Value can be one of the
213216
following:
214217

215-
- ``json``
216-
- ``json.gz``
217218
- ``bson``
218219
- ``bson.gz``
220+
- ``csv``
221+
- ``csv.gz``
222+
- ``json``
223+
- ``json.gz``
224+
- ``parquet``
225+
- ``tsv``
226+
- ``tsv.gz``
227+
228+
.. seealso::
229+
230+
:ref:`Limitations <adl-out-stage-limitations>`
219231

220232
- Required
221233

@@ -260,6 +272,63 @@ Fields
260272

261273
- Optional
262274

275+
* - | ``s3``
276+
| ``.format``
277+
| ``.maxRowGroupSize``
278+
- string
279+
- Supported for Parquet file format only.
280+
281+
Maximum row group size to use when writing to Parquet
282+
file. If omitted, defaults to ``1 GB`` or the value of the
283+
``s3.format.maxFileSize``, whichever is smaller.
284+
- Optional
285+
286+
* - | ``s3``
287+
| ``.format``
288+
| ``.columnCompression``
289+
- string
290+
- Supported for Parquet file format only.
291+
292+
Compression type to apply for compressing data inside a
293+
Parquet file when formatting the Parquet file. Valid
294+
values are:
295+
296+
- ``gzip``
297+
- ``snappy``
298+
- ``uncompressed``
299+
300+
If omitted, defaults to ``snappy``.
301+
302+
.. seealso::
303+
304+
:ref:`data-lake-data-formats`
305+
306+
- Optional
307+
308+
* - ``errorMode``
309+
- enum
310+
- Specifies how {+adl+} should proceed if there are errors
311+
when processing a document. For example, if {+adl+}
312+
encounters an array in a document when {+adl+} is writing
313+
to a CSV file, {+adl+} uses this value to determine
314+
whether or not to skip the document and process other
315+
documents. Valid values are:
316+
317+
- ``continue`` to skip the document and continue
318+
processing the remaining documents. {+adl+} also writes
319+
the document that caused the error to an error file.
320+
321+
.. seealso::
322+
323+
:ref:`Errors <adl-out-stage-errors>`
324+
325+
- ``stop`` to stop at that point and not process
326+
the remaining documents.
327+
328+
If omitted, defaults to ``continue``.
329+
330+
- Optional
331+
263332
.. tab:: Atlas Cluster
264333
:tabid: atlas
265334

@@ -603,6 +672,8 @@ Examples
603672

604673
**Limitations**
605674

675+
*String Data Type*
676+
606677
{+dl+} interprets empty strings (``""``) as ``null`` values when
607678
parsing filenames. If you want {+dl+} to generate parseable
608679
filenames, wrap the field references that could have ``null``
@@ -642,6 +713,60 @@ Examples
642713
}
643714
}
644715

716+
*CSV and TSV File Format*
717+
718+
When writing to CSV or TSV format, {+adl+} does not support the
719+
following data types in the documents:
720+
721+
- Arrays
722+
- DB pointer
723+
- JavaScript
724+
- JavaScript code with scope
725+
- Minimum or maximum key data type
726+
727+
In a CSV file, {+adl+} represents nested documents using the dot
728+
(``.``) notation. For example, {+adl+} writes
729+
``{ x: { a: 1, b: 2 } }`` as the following in the CSV file:
730+
731+
.. code-block:: shell
732+
:copyable: false
733+
734+
x.a,x.b
735+
1,2
736+
737+
{+adl+} represents all other data types as strings. Therefore,
738+
the data types in MongoDB read back from the CSV file may not be
739+
the same as the data types in the original |bson| documents from
740+
which the data types were written.
741+
742+
*Parquet File Format*
743+
744+
For Parquet, {+adl+} reads back fields with null or undefined
745+
values as missing because Parquet doesn't distinguish between
746+
null or undefined values and missing values. Although {+adl+}
747+
supports all data types, for |bson| data types that do not have
748+
a direct equivalent in Parquet, such as JavaScript, regular
749+
expression, etc., it:
750+
751+
- Chooses a representation that allows
752+
the resulting Parquet file to be read back using a non-MongoDB
753+
tool.
754+
- Stores a MongoDB schema in the Parquet file's key/value
755+
metadata so that {+adl+} can reconstruct the original |bson|
756+
document with the correct data types if the Parquet file is
757+
read back by {+adl+}.
758+
759+
.. tab:: Atlas Cluster
760+
:tabid: atlas
761+
762+
.. _adl-out-stage-errors:
763+
764+
.. tabs::
765+
:hidden:
766+
767+
.. tab:: S3
768+
:tabid: s3
769+
645770
**Errors**
646771

647772
- If the filename is not of type string, {+dl+} writes documents
@@ -653,13 +778,74 @@ Examples
653778

654779
.. example::
655780

656-
- ``s3://<bucket-name>/atlas-data-lake-{<CORRELATION_ID>}/$out-error-docs/1.json``
657-
- ``s3://<bucket-name>/atlas-data-lake-{<CORRELATION_ID>}/$out-error-docs/2.json``
781+
- ``s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-docs/1.json``
782+
- ``s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-docs/2.json``
658783

659784
{+dl+} returns an error message that specifies the number of
660785
documents that had invalid filenames and the directory where
661786
these documents were written.
662787

788+
- If {+dl+} encounters an error while processing a document,
789+
it writes the document to a special error file in your
790+
bucket.
791+
792+
.. code-block:: sh
793+
:copyable: false
794+
795+
s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-docs/{<n>}.json
796+
797+
where ``n`` is the index of the document being written.
798+
799+
{+dl+} also writes the error message for each document to an
800+
error index file:
801+
802+
.. code-block:: sh
803+
:copyable: false
804+
805+
s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-index/{<i>}.json
806+
807+
where ``i`` begins with ``1``. {+dl+} writes error messages to
808+
the file until the file reaches the ``maxFileSize`` and then,
809+
{+dl+} increments the value of ``i`` and continues writing any
810+
further error messages to the new file.
811+
812+
The error messages in the error index file look similar to the
813+
following:
814+
815+
.. example::
816+
817+
.. code-block:: json
818+
:copyable: false
819+
820+
{
821+
"n": 1234,
822+
"error": "field \"foo\" is of type array, which is not supported for CSV"
823+
}
824+
825+
where ``n`` is the index of the document which caused the error.
826+
827+
{+dl+} also creates an error summary file after running the entire
828+
query:
829+
830+
.. code-block:: sh
831+
:copyable: false
832+
833+
s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-summary.json
834+
835+
The summary file contains a single document for each type of error
836+
and a count of the number of documents which caused that type of
837+
error.
838+
839+
.. example::
840+
841+
.. code-block:: sh
842+
:copyable: false
843+
844+
{
845+
"errorType": "field is of type array, which is not supported for CSV",
846+
"count": 10
847+
}
848+
663849
.. tab:: Atlas Cluster
664850
:tabid: atlas
665851

0 commit comments

Comments
 (0)