Ocropy segmentation, squared #47

bertsky · 2020-05-09T21:16:47Z

This is a major rework of Ocropy's rule-based segmentation.

It greatly improves the situation with some long-standing problems, among them…

DPI relativity (again)
robust h/v-line separator detection (again)
text/image segmentation (sort-of)
conflation of very close text lines (vertically due to ascenders/descenders, horizontally due to noise)
reading order
performance (OpenCV and PIL instead of SciPy)

…but also offers solutions to unchartered terrain (for ocropus/ocrolib, that is)…

page segmentation into regions (via new variant of recursive X-Y cut)
table recognition (but not detection!)
incremental annotation (ignoring but sorting existing regions, deleting existing text regions from reading order)

The OCR-D processor most affected by this is ocrd-cis-ocropy-segment (now with a usable level-of-operation=page and a new level-of-operation=table), and to a lesser extent ocrd-cis-ocropy-resegment.

For the details, see changelog of the individual commits.

Here's from the segment processor docstring:

Segment pages into regions+lines, tables into cells+lines, or regions into lines.

Open and deserialise PAGE input files and their respective images,
then iterate over the element hierarchy down to the requested level.

Depending on level-of-operation, consider existing segments:

if overwrite_separators=True on page level, then
delete any SeparatorRegions,

if overwrite_regions=True on page level, then
delete any top-level TextRegions (along with ReadingOrder),

if overwrite_regions=True on table level, then
delete any TextRegions in TableRegions (along with their OrderGroup),

if overwrite_lines=True on region level, then
delete any TextLines in TextRegions.

Next, get each element image according to the layout annotation (from
the alternative image of the page/region, or by cropping via coordinates
into the higher-level image) in binarized form, and represent it as an array
with non-text regions and (remaining) text neighbours suppressed.

Then compute a text line segmentation for that array (as a label mask).
When level-of-operation is page or table, this also entails
detecting

up to maximages large foreground images,

up to maxseps foreground h/v-line separators and

up to maxcolseps background column separators
before text line segmentation itself, as well as aggregating text lines
to text regions afterwards.

Text regions are detected via a hybrid variant recursive X-Y cut algorithm
(RXYC): RXYC partitions the binarized image in top-down manner by detecting
horizontal or vertical gaps. This implementation uses the bottom-up text line
segmentation to guide the search, and also uses both pre-existing and newly
detected separators to alternatively partition the respective boxes into
non-rectangular parts.

During line segmentation, suppress the foreground of all previously annotated
regions (of any kind) and lines, except if just removed due to overwrite.
During region aggregation however, combine the existing separators with the
new-found separators to guide the column search.

All detected segments (both text line and text region) are sorted according
to their reading order (assuming a top-to-bottom, left-to-right ordering).
When level-of-operation is page, prefer vertical (column-first)
succession of regions. When it is table, prefer horizontal (row-first)
succession of cells.

Then for each resulting segment label, convert its background mask into
polygon outlines by finding the outer contours consistent with the element's
polygon outline. Annotate the result by adding it as a new TextLine/TextRegion:

If level-of-operation is region, then append the new lines to the
parent region.

If it is table, then append the new lines to their respective regions,
and append the new regions to the parent table.
(Also, create an OrderedGroup for it as the parent's RegionRef.)

If it is page, then append the new lines to their respective regions,
and append the new regions to the page.
(Also, create an OrderedGroup for it in the ReadingOrder.)

Produce a new output file by serialising the resulting hierarchy.

(Example images following shortly.)

(instead of doing ad-hoc binarization, which is either redundant and thus wastes time, or may be a suboptimal workflow choice)

(by using Shapely instead of CV2 to simplify polygons)

If level-of-operation=region, then recurse into text regions within table regions, finding text lines. If a table does not have any text regions yet, then add one pseudo-block to it which covers the whole table (and recurse into that, but in fullpage mode to also detect h/v-lines and white space columns).

- morph.all_neighbors: - fix return value (keep pairs) - add kwarg bg to skip - use true shift instead of roll, and fill with bg - add kwarg dist for arbitrary distance - psegutils.compute_boxmap: break loop earlier (faster)

- make fg h/v-line detection more robust to non-contiguous or slightly skewed/bent shapes, as well as overlapping/touching glyphs: - compute_separators_morph: better algorithm based on a sequence of vertical/horizontal open and close operations, combined with binary reconstruction/seedfill - remove_hlines: deprecate - compute_hlines: new implementation analoguous to compute_separators_morph - return slightly dilated (enlarged) masks to facilitate annotating polygon contours (besides immediately suppressing foreground) - improve bg column separator detection: - vertically dilate gradient edges before (not after) combining with thresholded whitespace - run after (not before) removing v-lines - call scale estimation with zoom parameter (making this central measure itself become DPI-relative) - revise earlier doubling of estimated scale: now only necessary for compute_line_seeds (but not for separator and gradmap estimation); - compute_line_seeds: - use vscale=2 - respect colseps early on - hmerge_line_seeds: more robust based on morphology and consistency of center point (y center inside other's bbox, but x center not inside other's bbox), not only global overlap counts - rename compute_line_labels back to compute_segmentation - also improve spreading line seeds to background: - watch fg components: split seed conflicts, but keep others on their majority side - simplify (much faster) - lines2regions: new implementation: instead of bottom-up bbox matching/merging rules, combine adjacent lines by morphologically closing while splitting at fg h/v-line and bg colseps - improve documentation - use type-check decorators (as in ocrolib proper) - uncomment all DSAVE statements, but disable function via decorator (as single place to re-enable all file-based or interactive plots)

- introduce/pass new ocrd-tool parameters - hlminwidth (minimum length of h-lines, in multiples of scale) - csminheight (minimum length of v-lines, in multiples of scale) - after page segmentation, also add detected text lines below text regions

- do not give up when a contour retains too small a share of the background (only foreground is relevant for threshold)

- common.compute_segmentation: - read binary image instead of grayscale-normalized - pass in external separator mask to be combined with detected separators - move sanity checks to common.check_* - segment: - aggregate all previously existing regions, including text regions/lines (except when removing them anyway via `overwrite`), and suppress them in the foreground while doing line segmentation; moreover, pass any separators among them extra (to guide region segmentation) - when `level-of-operation=page`, after suppressing existing tables, iterate through them and if they do not contain text regions (cells) yet, segment them into regions and lines likewise - README: update

- common.compute_segmentation with fullpage (for segment on page level): prevent sepmask (from h/v-lines and colseps) from being filled when spreading line seeds into background by provisionally attaching a label for it and spreading it against the other line labels - common.compute_line_seeds: skip aggregating height statistics for warnings of large lines (too slow)

- when adding text regions to pages or tables, also add to (recursive) ReadingOrder in the order of region labels; - when indexed, start from existing elements; - when adding cells to a table, convert RegionRef(Indexed) to an equally located OrderedGroup(Indexed)

- instead of silently segmenting existing tables when running on page level, now simply ignore tables (like any other non-text regions), but offer a dedicated table level, which ignores text regions (like any other non-table regions), and only segments tables without existing cells - add overwrite_separators=True on page level (for finding separators via ocrolib instead of ignoring existing separator regions)

- uniform_filter based dilate/erode/open/close: replace exact zero with approx infinitesimal to avoid artifacts from rounding - all: correct pixel origin depending on even/odd filter size to avoid asymmetric results as best as possible (even kernel sizes of course will still cause asymmetry, but to a lesser extent)

…analysis

- morph.reading_order: instead of providing only plain y.start top-down ordering, add this combined strategy doing both - y.center top-down but also - y.overlaps x.non-overlaps left-right and also offer the reverse (bottom-up, right-left) - sl: add missing functions: - compose: slice the slice! - xcenter_in / ycenter_in - top / bottom / left / right

- compute_images: new function to detect and suppress large foreground objects early on that are not h/v-lines (and _not_ search for h/v-lines or assign text lines within them); this could be true graphics/photos/figures, but also drop-capitals - compute_hlines / compute_separators_morph: reconstruct after opening by length not by keeping any overlapping component, but only up to a certain distance (to avoid overlapping glyphs but still get most of the line's parts); ignore parts that already belong to image components identified by compute_images (so they don't compete in n-best race) - compute_colseps_conv: - don't blur away small/protruding glyphs below fg/bg threshold - don't use cleaned but raw boxmap (again) to avoid marking small fg as bg - compute_segmentation: filter out line labels that only have noise fg (i.e. components that have been filtered as too small/large) - ensure odd kernel sizes everywhere - DSAVE (visualization for debugging): use uniformly bright and maximally differentiating colormap, and set off 0 (background) to black; allow passing a second array with foreground as white

- lines2regions: instead of bbox consistency heuristics, implement a hybrid recursive X-Y cut segmentation, which not only considers horizontal/vertical gaps in foreground (discounting noise pixels), but also avoids splitting line labels, and uses the detected/pass-in separators to alternatively cut at non-rectangular partitions instead of horizontal/vertical gaps - sort all line labels, gap-based slices and separator-based partitions via proper (top-down left-to-right, or reversed) reading order

- add an ImageRegion for each image found by compute_segmentation - follow-up on 4bb1ddb (incremental annotation): during line segmentation, merely suppress neighbouring/other existing segments, but during region segmentation, pass separators as sepmask but other regions as pseudo-line labels to be identified within reading order; afterwards, re-identify them (to avoid adding new elements, but still reference them accordingly in the reading order group) - follow-up on 7748ca4 (add reading order): - reference each region in the ReadingOrder, increasing @index when in an OrderedGroup(Indexed) (on Page or in TableRegion) - for TableRegions, replace existing RegionRef(Indexed) by an equally indexed OrderedGroup(Indexed) to hold all the cell regions - do the right thing in ReadingOrder even when `overwrite_regions=True`

- for all pre-existing and new-found separators and images, create a derived image where they are suppressed (white) - write that image (with `clipped` in @comments) to the second output file group (or `OCR-D-IMG-CLIP`)

bertsky · 2020-05-09T21:23:57Z

Also fixes #41 (and a still unreported regression from 48a89e9), both in recognize.

lgtm-com · 2020-05-09T21:38:37Z

This pull request introduces 10 alerts and fixes 2 when merging f242984 into 48a89e9 - view on LGTM.com

new alerts:

5 for Unused local variable
2 for Unused import
1 for 'import *' may pollute namespace
1 for Variable defined multiple times
1 for Nested loops with same variable

fixed alerts:

2 for Except block handles 'BaseException'

bertsky · 2020-05-09T22:11:49Z

Example A

original (provided by @wrznr)
binarization with ocrd-olena-binarize (sauvola-ms-split / k=0.1)
large component, non-separator ("image") detection
h-line separator detection
v-line separator detection
column separator detection, background threshold
column separator detection, horizontal gradient map
column separator detection, combined (final colwsseps)
all separators/images combined
textline detection, vertical gradient map
textline detection, bottom/top marks and line seed
textline detection, filtered line labels
textline detection, spreading from fg into bg, but also against separators
textline detection, final line labels
recursive X-Y cut, full-size box with x/y profiles at the margins
recursive X-Y cut, full-size box with all gap candidates (un/prominent, dis/allowed)
recursive X-Y cut, first vertical slice with all gaps
recursive X-Y cut, second vertical slice with all gaps
recursive X-Y cut, second vertical slice, second horizontal slice
recursive X-Y cut, second vertical slice, second horizontal slice
recursive X-Y cut, second vertical slice, second horizontal slice
recursive X-Y cut, second vertical slice, second horizontal slice
recursive X-Y cut, second vertical slice, second horizontal slice
recursive X-Y cut, second vertical slice, second horizontal slice (no more cuts)
recursive X-Y cut, third vertical slice with all gaps
recursive X-Y cut, third vertical slice, first vertical slice with all gaps
recursive X-Y cut, third vertical slice, first vertical slice with 3 vertical partitions
recursive X-Y cut, third vertical slice, first vertical slice, middle partition with all gaps
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice with all gaps
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice with vertical gaps

note: we don't choose the most prominent gaps here, because they would produce partitions that sum up to less total height than the partitions created by the less prominent blue gap (we value height more than width because we are segmenting a page and not a table)
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice, first vertical slice with all gaps
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice, first vertical slice with 3 vertical partitions
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice, first vertical slice, left partition with all gaps
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice, first vertical slice, left partition, all 5 slices
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice, second vertical slice with all gaps
recursive X-Y cut, third vertical slice, first vertical slice, middle partition, second vertical slice, second vertical slice, second vertical slice, third horizontal slice with all gaps
recursive X-Y cut, final regions with contours around lines
final result in PageViewer with reading order and quite a few pseudo-regions due to binarization noise

bertsky · 2020-05-09T22:31:53Z

Example B

original (provided by @jbarth-ubhd, downsampled from 1200 to 300 DPI for Github)
large component, non-separator ("image") detection
h-line separator detection
v-line separator detection
column separator detection, combined (final colwsseps)
textline detection, filtered line labels
textline detection, final line labels
recursive X-Y cut, full-size box with x/y profiles at the margins
recursive X-Y cut, full-size box with 2 vertical partitions
recursive X-Y cut, left partition with all gaps
recursive X-Y cut, left partition, first vertical slice (no more cuts possible)
recursive X-Y cut, left partition, second vertical slice (no more cuts possible)
recursive X-Y cut, right partition with all gaps
recursive X-Y cut, final regions with contours around lines
final result in PageViewer with reading order and benign under-segmentation

bertsky · 2020-05-09T23:40:30Z

Example C

original (artificial, digital-born)
Tesseract segmentation in PageViewer (malign under-segmentation in paragraphs and cells, and textlines crossing separators, table detection with sub-optimal boundaries)

note: We'll throw away separators (overwrite_separators) and text regions (overwrite_regions), keeping only table regions, because we have no table detection of our own yet!
h-line separator detection (cutting off overlapping glyph)
v-line separator detection (neglecting segments which are only 1 line in height)
column separator detection, combined (final colwsseps)

Note: we have suppressed the table regions that we want to keep during page segmentation (just not for h/v-line detection)
all separators combined (including ignored regions)
textline detection, filtered line labels
textline detection, spreading from fg into bg, but also against separators
textline detection, final line labels
recursive X-Y cut, full-size box with x/y profiles at the margins
recursive X-Y cut, full-size box with 5 vertical partitions

Note: the upper table's separators bleed into the overall background, so they don't yield separate partitions at this iteration; the lower 2 tables are isolated enough to create true partitions, but since they also contain large existing table regions, a number of would-be partitions get fused together
recursive X-Y cut, upper/largest partition with all gaps
recursive X-Y cut, upper/largest partition, first vertical slice with x/y profiles at the margins

Note: we are not allowed to cut through the existing table region segment now, but the heading is too close for a cut between text and table
recursive X-Y cut, upper/largest partition, first vertical slice with 2 partitions

Note: within this slice, we at least get to partition the heading against the table (but they would be split afterwards even if they had been kept in one region label)
recursive X-Y cut, upper/largest partition, second vertical slice with all gaps
recursive X-Y cut, upper/largest partition, third vertical slice, second vertical slice, first horizontal slice with all gaps
recursive X-Y cut, upper/largest partition, third vertical slice, second vertical slice, second horizontal slice with all gaps
recursive X-Y cut, third partition with x/y profiles at the margins

Note: again, we are not allowed to cut within the existing table region, and the adjacent text region is too close for a cut
recursive X-Y cut, final regions with line contours
recursive X-Y cut, final regions with contours around lines
page segmentation result in PageViewer with reading order including tables (but still without recursive structure)

Note: now we can enter level-of-operation=table for the 3 table instances
all separators combined (fg lines existing from page segmentation and bg colseps detected here)
textline detection, filtered line labels
textline detection, spreading from fg into bg, but also against separators
recursive X-Y cut, full-size box with x/y profiles at the margins
recursive X-Y cut, full-size box with 8 partitions
recursive X-Y cut, final regions with contours around lines
final result in PageViewer with reading order and recursive table structure

bertsky · 2020-05-09T23:52:49Z

One word about performance: resegment and region-level segment are much faster than before, but page/table-level segment is slower than before because recursive X-Y cut (despite not using back-tracking) takes its toll. However, a 300 DPI page still should not take more than 30s. (Images with high pixel density do not get downsampled yet, so runtime will probably increase quadratically. There is a lot of head-room for further optimizations, e.g. not repeating component analysis unnecessarily every other line.)

I should also mention that there are quite a few parameters to control page segmentation. I have no idea how general my defaults are though. If you get strange results, look at the number and length of lines to be detected, number of images to be detected, and especially gap_width and gap_height. (Plus I recommend Sylwester&Seth 1996: A Trainable, Single-Pass Algorithm for Column Segmentation for ideas how to optimise these from GT data.)

Or activate the DSAVE function to visualise intermediate results as shown above (either interactively via pyplot.show() or as files via pyplot.imsave():

ocrd_cis/ocrd_cis/ocropy/common.py

Lines 389 to 390 in f242984

    
           @disabled() 
        
           def DSAVE(title,array, interactive=False):

lgtm-com · 2020-05-10T16:49:10Z

This pull request introduces 6 alerts and fixes 3 when merging 907c00f into 48a89e9 - view on LGTM.com

new alerts:

3 for Unused local variable
1 for 'import *' may pollute namespace
1 for Variable defined multiple times
1 for Nested loops with same variable

fixed alerts:

2 for Except block handles 'BaseException'
1 for 'import *' may pollute namespace

lgtm-com · 2020-05-10T19:24:32Z

This pull request introduces 1 alert and fixes 3 when merging 32786a6 into 48a89e9 - view on LGTM.com

new alerts:

1 for 'import *' may pollute namespace

fixed alerts:

2 for Except block handles 'BaseException'
1 for 'import *' may pollute namespace

lgtm-com · 2020-05-10T21:50:01Z

This pull request introduces 1 alert and fixes 3 when merging b505d65 into 48a89e9 - view on LGTM.com

new alerts:

1 for 'import *' may pollute namespace

fixed alerts:

2 for Except block handles 'BaseException'
1 for 'import *' may pollute namespace

bertsky · 2020-05-11T10:07:54Z

Thanks @finkf for the invitation!

Cannot invite non-collaborators for a review directly, but if @kba or @wrznr would care to give it a try, that would be awesome.

finkf

I am OK with it.

Robert Sachunsky added 26 commits May 9, 2020 22:00

deskew: boost performance by rotating via PIL instead of scipy

1c6096b

segment/resegment: require images to be binarized already...

ca4d831

(instead of doing ad-hoc binarization, which is either redundant and thus wastes time, or may be a suboptimal workflow choice)

segment/resegment: avoid (rare) invalid coordinates...

74710db

(by using Shapely instead of CV2 to simplify polygons)

ocrolib: make basic scale estimation zoomable (DPI-relative)!

8f38480

ocrolib: minor improvements...

9a17bc8

- morph.all_neighbors: - fix return value (keep pairs) - add kwarg bg to skip - use true shift instead of roll, and fill with bg - add kwarg dist for arbitrary distance - psegutils.compute_boxmap: break loop earlier (faster)

segment: adapt to changes and...

954973c

- introduce/pass new ocrd-tool parameters - hlminwidth (minimum length of h-lines, in multiples of scale) - csminheight (minimum length of v-lines, in multiples of scale) - after page segmentation, also add detected text lines below text regions

resegment: adapt to changes and...

24e3ead

- do not give up when a contour retains too small a share of the background (only foreground is relevant for threshold)

binarize: expose 'threshold' parameter

faad379

ocrolib: fix loading uncompressed model (Py2/3)

3bb1113

ocrolib: fallback heuristic for basic scale estimation

bbf1c29

recognize: fix regression from 48a89e9

f756167

ocrolib.morph: replace SciPy with faster OpenCV morphology/component …

90257da

…analysis

ocrolib.morph: add utility for performance comparisons

625a47a

segment: add an AlternativeImage clipping non-text to bg...

dddf5cc

- for all pre-existing and new-found separators and images, create a derived image where they are suppressed (white) - write that image (with `clipped` in @comments) to the second output file group (or `OCR-D-IMG-CLIP`)

re/segment: fix polygons (keep detected polygon paths _open_)

f242984

bertsky mentioned this pull request May 9, 2020

segment-line: Self-intersection at or near point ... OCR-D/ocrd_tesserocr#123

Closed

make LGTM checker happy

32786a6

bertsky force-pushed the segment-table-lines branch from 907c00f to 32786a6 Compare May 10, 2020 19:02

segment: don't try to add if no reading order group exists

b505d65

bertsky requested a review from finkf May 11, 2020 10:00

bertsky linked an issue May 11, 2020 that may be closed by this pull request

ocrd-cis-ocropy-recognize: 'ascii' codec can't decode byte 0xa9 #41

Closed

finkf approved these changes May 11, 2020

View reviewed changes

finkf merged commit fe129fe into cisocrgroup:dev May 11, 2020

bertsky mentioned this pull request Nov 20, 2020

ocrd-cis-ocropy-segment default level-of-operation #81

Closed

bertsky mentioned this pull request Jan 12, 2021

Rewrite OCR-D/ocrd_kraken#33

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ocropy segmentation, squared #47

Ocropy segmentation, squared #47

Uh oh!

bertsky commented May 9, 2020 •

edited

Loading

Uh oh!

bertsky commented May 9, 2020

Uh oh!

lgtm-com bot commented May 9, 2020

Uh oh!

bertsky commented May 9, 2020 •

edited

Loading

Uh oh!

bertsky commented May 9, 2020

Uh oh!

bertsky commented May 9, 2020 •

edited

Loading

Uh oh!

bertsky commented May 9, 2020 •

edited

Loading

Uh oh!

lgtm-com bot commented May 10, 2020

Uh oh!

lgtm-com bot commented May 10, 2020

Uh oh!

lgtm-com bot commented May 10, 2020

Uh oh!

bertsky commented May 11, 2020

Uh oh!

finkf left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ocropy segmentation, squared #47

Ocropy segmentation, squared #47

Uh oh!

Conversation

bertsky commented May 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertsky commented May 9, 2020

Uh oh!

lgtm-com bot commented May 9, 2020

Uh oh!

bertsky commented May 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example A

Uh oh!

bertsky commented May 9, 2020

Example B

Uh oh!

bertsky commented May 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example C

Uh oh!

bertsky commented May 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com bot commented May 10, 2020

Uh oh!

lgtm-com bot commented May 10, 2020

Uh oh!

lgtm-com bot commented May 10, 2020

Uh oh!

bertsky commented May 11, 2020

Uh oh!

finkf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bertsky commented May 9, 2020 •

edited

Loading

bertsky commented May 9, 2020 •

edited

Loading

bertsky commented May 9, 2020 •

edited

Loading

bertsky commented May 9, 2020 •

edited

Loading

finkf left a comment •

edited

Loading