-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Currently, all white space characters in a textbox are merged into a single space character (' ')
This makes it very difficult to extract tabular data.
In #106, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.
:spaces(default)
all white spaces are handled as a single space character:tabs
non-space white spaces are handled as tab characters:boxes
text between non-space white spaces is split into several textboxes with respective coordinates
For this purpose get_TextBox() no longer returns a tuple text, w, h but a vector of tuples text, w, h, offset.
During evalContent!() the vector is itereated to return a TextLayout for each set of box parameters.
For the modes :spaces and :tabs get_TextBox()always returns a single-element vector, whereas in:boxes` mode more than one TextLayout might be added to the output.
The :spaces mode reproduces the current extraction behavior.
The :tab mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space character
The :boxes mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.
@sambitdash Please comment if this sounds like a desired feature to you.
If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.