-
Notifications
You must be signed in to change notification settings - Fork 12
Migrating to pl-fuzzy-frame-match #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Edwardvaneechoud
merged 14 commits into
main
from
improvement/replace_fuzzy_join_code_with_central_repo
Aug 22, 2025
Merged
Migrating to pl-fuzzy-frame-match #108
Edwardvaneechoud
merged 14 commits into
main
from
improvement/replace_fuzzy_join_code_with_central_repo
Aug 22, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Bennylave
pushed a commit
to Bennylave/Flowfile
that referenced
this pull request
Aug 26, 2025
* Migrating to pl-fuzzy-frame-match * adding fuzzy match * Adding fuzzy match method to flowgraph * Schema callback changes in fuzzy match * Fixing tests and increasing overlap between generator and flowfile * Adapted pl-fuzzy-frame-match changes in branch * fix issue with test * adding prints to the test to debug * Make the schema_callback.py threadsafe and the object in fuzzy matching as well. * remove warning in _handle_fuzzy_match * increasing version fuzzy frame match * reverting change in the execution * Improve threading and order fuzzy match results based on incoming data
Bennylave
pushed a commit
to Bennylave/Flowfile
that referenced
this pull request
Aug 26, 2025
* Migrating to pl-fuzzy-frame-match * adding fuzzy match * Adding fuzzy match method to flowgraph * Schema callback changes in fuzzy match * Fixing tests and increasing overlap between generator and flowfile * Adapted pl-fuzzy-frame-match changes in branch * fix issue with test * adding prints to the test to debug * Make the schema_callback.py threadsafe and the object in fuzzy matching as well. * remove warning in _handle_fuzzy_match * increasing version fuzzy frame match * reverting change in the execution * Improve threading and order fuzzy match results based on incoming data
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request refactors and improves the fuzzy join functionality in the FlowFile codebase, with a focus on simplifying the interface, improving test coverage, and centralizing fuzzy matching logic. The most important changes include replacing custom fuzzy matching implementations with the external
pl_fuzzy_frame_match
library, updating the API for fuzzy joins, and enhancing tests to cover both internal and external fuzzy join mechanisms.Fuzzy Join Refactoring and API Improvements:
flowfile_core
andflowfile_worker
with imports from the externalpl_fuzzy_frame_match
library, centralizing fuzzy matching logic and reducing code duplication. (flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
,flowfile_worker/flowfile_worker/funcs.py
,flowfile_worker/flowfile_worker/models.py
,flowfile_worker/flowfile_worker/polars_fuzzy_match/models.py
) [1] [2] [3] [4] [5]FlowDataEngine
by removing the olddo_fuzzy_join
andfuzzy_match
methods, and introducingfuzzy_join
(for internal, in-process joins) andfuzzy_join_external
(for external, possibly distributed joins). (flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
)fuzzy_match_dfs
signature with explicit keyword arguments and logger support. (flowfile_worker/flowfile_worker/funcs.py
)Testing Enhancements:
flowfile_core/tests/flowfile/flowfile_table/test_flow_data_engine.py
) [1] [2] [3]flowfile_core/tests/flowfile/test_flowfile.py
)Flow Graph Integration:
add_fuzzy_match
method in the flow graph to use the new internal fuzzy join method when running locally, improving flexibility and reducing unnecessary externalization. (flowfile_core/flowfile_core/flowfile/flow_graph.py
)These changes modernize the fuzzy join workflow, improve maintainability, and ensure that both internal and external fuzzy joins are robustly tested and easier to use.