Skip to content

Search algorithms #307

@Bodigrim

Description

@Bodigrim

I've been thinking about splitOn and replace for ByteString. They can be expressed in terms of breakSubstring, but the more I look at the latter the more doubts I get about its implementation.

bytestring/Data/ByteString.hs

Lines 1596 to 1613 in bd5412c

karpRabin src
| length src < lp = (src,empty)
| otherwise = search (rollingHash $ unsafeTake lp src) lp
where
k = 2891336453 :: Word32
rollingHash = foldl' (\h b -> h * k + fromIntegral b) 0
hp = rollingHash pat
m = k ^ lp
get = fromIntegral . unsafeIndex src
search !hs !i
| hp == hs && pat == unsafeTake lp b = u
| length src <= i = (src,empty) -- not found
| otherwise = search hs' (i + 1)
where
u@(_, b) = unsafeSplitAt (i - lp) src
hs' = hs * k +
get i -
m * get (i - lp)

What's the reason for Karp-Rabin here? It is great to search for multiple patterns at once, but this is not our case. I suspect that for non-pathological cases even a naive loop with memcmp could very well be faster. And for pathological inputs Karp-Rabin is O(mn) anyways. If we want to fix the worst case scenario, we should employ Knuth-Moris-Pratt or Boyer-Moore.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions