Skip to content

Conversation

Edwardvaneechoud
Copy link
Owner

This pull request refactors and improves the fuzzy join functionality in the FlowFile codebase, with a focus on simplifying the interface, improving test coverage, and centralizing fuzzy matching logic. The most important changes include replacing custom fuzzy matching implementations with the external pl_fuzzy_frame_match library, updating the API for fuzzy joins, and enhancing tests to cover both internal and external fuzzy join mechanisms.

Fuzzy Join Refactoring and API Improvements:

  • Replaced custom fuzzy matching logic and models in both flowfile_core and flowfile_worker with imports from the external pl_fuzzy_frame_match library, centralizing fuzzy matching logic and reducing code duplication. (flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py, flowfile_worker/flowfile_worker/funcs.py, flowfile_worker/flowfile_worker/models.py, flowfile_worker/flowfile_worker/polars_fuzzy_match/models.py) [1] [2] [3] [4] [5]
  • Updated the fuzzy join API in FlowDataEngine by removing the old do_fuzzy_join and fuzzy_match methods, and introducing fuzzy_join (for internal, in-process joins) and fuzzy_join_external (for external, possibly distributed joins). (flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py)
  • Modified the fuzzy join task in the worker to use the new fuzzy_match_dfs signature with explicit keyword arguments and logger support. (flowfile_worker/flowfile_worker/funcs.py)

Testing Enhancements:

  • Added new tests to cover both internal and external fuzzy join methods, ensuring correctness and robustness of the refactored API. (flowfile_core/tests/flowfile/flowfile_table/test_flow_data_engine.py) [1] [2] [3]
  • Added a test for running a fuzzy match locally within the flow graph, verifying that the local execution path works as intended. (flowfile_core/tests/flowfile/test_flowfile.py)

Flow Graph Integration:

  • Updated the add_fuzzy_match method in the flow graph to use the new internal fuzzy join method when running locally, improving flexibility and reducing unnecessary externalization. (flowfile_core/flowfile_core/flowfile/flow_graph.py)

These changes modernize the fuzzy join workflow, improve maintainability, and ensure that both internal and external fuzzy joins are robustly tested and easier to use.

@Edwardvaneechoud Edwardvaneechoud marked this pull request as ready for review August 22, 2025 07:33
@Edwardvaneechoud Edwardvaneechoud merged commit c1d1d1b into main Aug 22, 2025
12 checks passed
Bennylave pushed a commit to Bennylave/Flowfile that referenced this pull request Aug 26, 2025
* Migrating to pl-fuzzy-frame-match

* adding fuzzy match

* Adding fuzzy match method to flowgraph

* Schema callback changes in fuzzy match

* Fixing tests and increasing overlap between generator and flowfile

* Adapted pl-fuzzy-frame-match changes in branch

* fix issue with test

* adding prints to the test to debug

* Make the schema_callback.py threadsafe and the object in fuzzy matching as well.

* remove warning in _handle_fuzzy_match

* increasing version fuzzy frame match

* reverting change in the execution

* Improve threading and order fuzzy match results based on incoming data
Bennylave pushed a commit to Bennylave/Flowfile that referenced this pull request Aug 26, 2025
* Migrating to pl-fuzzy-frame-match

* adding fuzzy match

* Adding fuzzy match method to flowgraph

* Schema callback changes in fuzzy match

* Fixing tests and increasing overlap between generator and flowfile

* Adapted pl-fuzzy-frame-match changes in branch

* fix issue with test

* adding prints to the test to debug

* Make the schema_callback.py threadsafe and the object in fuzzy matching as well.

* remove warning in _handle_fuzzy_match

* increasing version fuzzy frame match

* reverting change in the execution

* Improve threading and order fuzzy match results based on incoming data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants