-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
ENH: Add on_bad_lines for pyarrow #54643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8b760eb
to
62c873b
Compare
Does this fix any tests? There are some xfailed tests for |
@lithomas1 The tests have been refactored to not fail for both the "python" engine and the "pyarrow" engine (see: pandas/tests/io/parser/test_unsupported.py) I didn't find any other tests that were invalidated (as confirmed by the CI). |
@lithomas1 btw, is there any specific action I should be taking to request a review? |
4ee6141
to
bee7351
Compare
Rebased from master to pull in compat workaround so that CI Passes |
Can you add a whatsnew entry to 2.2.0.rst? |
Try looking in pandas/tests/io/parser/common/test_read_errors.py Those tests are currently skippped, but on_bad_lines is being used there. You should be able to make at least some of them pass. (You can try xfailing the tests there by changing the "pyarrow_skip" to a "pyarrow_xfail") |
@lithomas1 I've gotten all the tests within |
@lithomas1 Are there any other changes that you'd like me to look at from the test perspective? |
Any updates on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cc @lithomas1 if you have any comments
@lithomas1 Anything else that needs to be covered in this pull request? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping. I left some comments.
# Conflicts: # pandas/tests/io/parser/common/test_read_errors.py
@lithomas1 I've chained the ArrowInvalid exception with ParserError so that the tests are uniform. Does this match what you had in mind? |
@lithomas1 Awaiting your review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. Thanks!
Thanks @amithkk |
This adds the
on_bad_lines
argument to thepyarrow
engine for thefrom_csv
parser that closely follows the behaviour of thepython
engine. Internally utilizes pyarrow's invalid_row_handler. The built-in callable implementation slightly differs for pyarrow, so the difference is appropriately documented by pointing to pyarrow's documentation.Usage Example:
example.csv:
example.py
Console output