Commit 1c7f0ab
[SPARK-3339][SQL] Support for skipping json lines that fail to parse
This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.
* To query those corrupt records
```
-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL
```
* To skip corrupt records and query regular records
```
-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL
```
Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.
Author: Yin Huai <[email protected]>
Closes #2680 from yhuai/corruptJsonRecord and squashes the following commits:
4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record".
b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test.
ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.1 parent 1faa113 commit 1c7f0ab
File tree
6 files changed
+116
-19
lines changed- sql/core/src
- main/scala/org/apache/spark/sql
- api/java
- json
- test/scala/org/apache/spark/sql/json
6 files changed
+116
-19
lines changedLines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| |||
131 | 132 | | |
132 | 133 | | |
133 | 134 | | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
134 | 138 | | |
135 | 139 | | |
136 | 140 | | |
| |||
Lines changed: 10 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
195 | 195 | | |
196 | 196 | | |
197 | 197 | | |
| 198 | + | |
198 | 199 | | |
199 | | - | |
200 | | - | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
201 | 204 | | |
202 | 205 | | |
203 | 206 | | |
| |||
206 | 209 | | |
207 | 210 | | |
208 | 211 | | |
209 | | - | |
210 | | - | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
211 | 217 | | |
212 | 218 | | |
213 | 219 | | |
| |||
Lines changed: 12 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | | - | |
152 | | - | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
153 | 157 | | |
154 | 158 | | |
155 | 159 | | |
| |||
162 | 166 | | |
163 | 167 | | |
164 | 168 | | |
| 169 | + | |
165 | 170 | | |
166 | 171 | | |
167 | | - | |
168 | | - | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
169 | 177 | | |
170 | 178 | | |
171 | 179 | | |
| |||
Lines changed: 20 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
38 | | - | |
39 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
40 | 42 | | |
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
44 | | - | |
| 46 | + | |
| 47 | + | |
45 | 48 | | |
46 | 49 | | |
47 | | - | |
| 50 | + | |
| 51 | + | |
48 | 52 | | |
49 | 53 | | |
50 | 54 | | |
| |||
274 | 278 | | |
275 | 279 | | |
276 | 280 | | |
277 | | - | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
278 | 284 | | |
279 | 285 | | |
280 | 286 | | |
| |||
289 | 295 | | |
290 | 296 | | |
291 | 297 | | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
296 | 303 | | |
297 | | - | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
298 | 308 | | |
299 | 309 | | |
300 | 310 | | |
| |||
Lines changed: 61 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
| |||
644 | 646 | | |
645 | 647 | | |
646 | 648 | | |
647 | | - | |
| 649 | + | |
648 | 650 | | |
649 | 651 | | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
650 | 710 | | |
Lines changed: 9 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
143 | 143 | | |
144 | 144 | | |
145 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
146 | 155 | | |
0 commit comments