@@ -94,9 +94,12 @@ Syntax
94
94
"region": "<aws-region>",
95
95
"filename": "<file-name>",
96
96
"format": {
97
- "name": "json|json.gz|bson|bson.gz",
98
- "maxFileSize": "<file-size>"
99
- }
97
+ "name": "<file-format>",
98
+ "maxFileSize": "<file-size>",
99
+ "maxRowGroupSize": "<row-group-size>",
100
+ "columnCompression": "<compression-type>"
101
+ },
102
+ "errorMode": "stop"|"continue"
100
103
}
101
104
}
102
105
}
@@ -212,10 +215,19 @@ Fields
212
215
- Format of the file in |s3|. Value can be one of the
213
216
following:
214
217
215
- - ``json``
216
- - ``json.gz``
217
218
- ``bson``
218
219
- ``bson.gz``
220
+ - ``csv``
221
+ - ``csv.gz``
222
+ - ``json``
223
+ - ``json.gz``
224
+ - ``parquet``
225
+ - ``tsv``
226
+ - ``tsv.gz``
227
+
228
+ .. seealso::
229
+
230
+ :ref:`Limitations <adl-out-stage-limitations>`
219
231
220
232
- Required
221
233
@@ -260,6 +272,63 @@ Fields
260
272
261
273
- Optional
262
274
275
+ * - | ``s3``
276
+ | ``.format``
277
+ | ``.maxRowGroupSize``
278
+ - string
279
+ - Supported for Parquet file format only.
280
+
281
+ Maximum row group size to use when writing to Parquet
282
+ file. If omitted, defaults to ``1 GB`` or the value of the
283
+ ``s3.format.maxFileSize``, whichever is smaller.
284
+ - Optional
285
+
286
+ * - | ``s3``
287
+ | ``.format``
288
+ | ``.columnCompression``
289
+ - string
290
+ - Supported for Parquet file format only.
291
+
292
+ Compression type to apply for compressing data inside a
293
+ Parquet file when formatting the Parquet file. Valid
294
+ values are:
295
+
296
+ - ``gzip``
297
+ - ``snappy``
298
+ - ``uncompressed``
299
+
300
+ If omitted, defaults to ``snappy``.
301
+
302
+ .. seealso::
303
+
304
+ :ref:`data-lake-data-formats`
305
+
306
+ - Optional
307
+
308
+ * - ``errorMode``
309
+ - enum
310
+ - Specifies how {+adl+} should proceed if there are errors
311
+ when processing a document. For example, if {+adl+}
312
+ encounters an array in a document when {+adl+} is writing
313
+ to a CSV file, {+adl+} uses this value to determine
314
+ whether or not to skip the document and process other
315
+ documents. Valid values are:
316
+
317
+ - ``continue`` to skip the document and continue
318
+ processing the remaining documents. {+adl+} also writes
319
+ the document that caused the error to an error file.
320
+
321
+ .. seealso::
322
+
323
+ :ref:`Errors <adl-out-stage-errors>`
324
+
325
+ - ``stop`` to stop at that point and not process
326
+ the remaining documents.
327
+
328
+ If omitted, defaults to ``continue``.
329
+
330
+ - Optional
331
+
263
332
.. tab:: Atlas Cluster
264
333
:tabid: atlas
265
334
@@ -603,6 +672,8 @@ Examples
603
672
604
673
**Limitations**
605
674
675
+ *String Data Type*
676
+
606
677
{+dl+} interprets empty strings (``""``) as ``null`` values when
607
678
parsing filenames. If you want {+dl+} to generate parseable
608
679
filenames, wrap the field references that could have ``null``
@@ -642,6 +713,60 @@ Examples
642
713
}
643
714
}
644
715
716
+ *CSV and TSV File Format*
717
+
718
+ When writing to CSV or TSV format, {+adl+} does not support the
719
+ following data types in the documents:
720
+
721
+ - Arrays
722
+ - DB pointer
723
+ - JavaScript
724
+ - JavaScript code with scope
725
+ - Minimum or maximum key data type
726
+
727
+ In a CSV file, {+adl+} represents nested documents using the dot
728
+ (``.``) notation. For example, {+adl+} writes
729
+ ``{ x: { a: 1, b: 2 } }`` as the following in the CSV file:
730
+
731
+ .. code-block:: shell
732
+ :copyable: false
733
+
734
+ x.a,x.b
735
+ 1,2
736
+
737
+ {+adl+} represents all other data types as strings. Therefore,
738
+ the data types in MongoDB read back from the CSV file may not be
739
+ the same as the data types in the original |bson| documents from
740
+ which the data types were written.
741
+
742
+ *Parquet File Format*
743
+
744
+ For Parquet, {+adl+} reads back fields with null or undefined
745
+ values as missing because Parquet doesn't distinguish between
746
+ null or undefined values and missing values. Although {+adl+}
747
+ supports all data types, for |bson| data types that do not have
748
+ a direct equivalent in Parquet, such as JavaScript, regular
749
+ expression, etc., it:
750
+
751
+ - Chooses a representation that allows
752
+ the resulting Parquet file to be read back using a non-MongoDB
753
+ tool.
754
+ - Stores a MongoDB schema in the Parquet file's key/value
755
+ metadata so that {+adl+} can reconstruct the original |bson|
756
+ document with the correct data types if the Parquet file is
757
+ read back by {+adl+}.
758
+
759
+ .. tab:: Atlas Cluster
760
+ :tabid: atlas
761
+
762
+ .. _adl-out-stage-errors:
763
+
764
+ .. tabs::
765
+ :hidden:
766
+
767
+ .. tab:: S3
768
+ :tabid: s3
769
+
645
770
**Errors**
646
771
647
772
- If the filename is not of type string, {+dl+} writes documents
@@ -653,13 +778,74 @@ Examples
653
778
654
779
.. example::
655
780
656
- - ``s3://<bucket-name>/atlas-data-lake-{<CORRELATION_ID >}/$ out-error-docs/1.json``
657
- - ``s3://<bucket-name>/atlas-data-lake-{<CORRELATION_ID >}/$ out-error-docs/2.json``
781
+ - ``s3://{ <bucket-name>} /atlas-data-lake-{<correlation-id >}/out-error-docs/1.json``
782
+ - ``s3://{ <bucket-name>} /atlas-data-lake-{<correlation-id >}/out-error-docs/2.json``
658
783
659
784
{+dl+} returns an error message that specifies the number of
660
785
documents that had invalid filenames and the directory where
661
786
these documents were written.
662
787
788
+ - If {+dl+} encounters an error while processing a document,
789
+ it writes the document to a special error file in your
790
+ bucket.
791
+
792
+ .. code-block:: sh
793
+ :copyable: false
794
+
795
+ s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-docs/{<n>}.json
796
+
797
+ where ``n`` is the index of the document being written.
798
+
799
+ {+dl+} also writes the error message for each document to an
800
+ error index file:
801
+
802
+ .. code-block:: sh
803
+ :copyable: false
804
+
805
+ s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-index/{<i>}.json
806
+
807
+ where ``i`` begins with ``1``. {+dl+} writes error messages to
808
+ the file until the file reaches the ``maxFileSize`` and then,
809
+ {+dl+} increments the value of ``i`` and continues writing any
810
+ further error messages to the new file.
811
+
812
+ The error messages in the error index file look similar to the
813
+ following:
814
+
815
+ .. example::
816
+
817
+ .. code-block:: json
818
+ :copyable: false
819
+
820
+ {
821
+ "n": 1234,
822
+ "error": "field \"foo\" is of type array, which is not supported for CSV"
823
+ }
824
+
825
+ where ``n`` is the index of the document which caused the error.
826
+
827
+ {+dl+} also creates an error summary file after running the entire
828
+ query:
829
+
830
+ .. code-block:: sh
831
+ :copyable: false
832
+
833
+ s3://{<bucket-name>}/atlas-data-lake-{<correlation-id>}/out-error-summary.json
834
+
835
+ The summary file contains a single document for each type of error
836
+ and a count of the number of documents which caused that type of
837
+ error.
838
+
839
+ .. example::
840
+
841
+ .. code-block:: sh
842
+ :copyable: false
843
+
844
+ {
845
+ "errorType": "field is of type array, which is not supported for CSV",
846
+ "count": 10
847
+ }
848
+
663
849
.. tab:: Atlas Cluster
664
850
:tabid: atlas
665
851
0 commit comments