-
Notifications
You must be signed in to change notification settings - Fork 144
Open
Labels
Description
I've been thinking about splitOn
and replace
for ByteString
. They can be expressed in terms of breakSubstring
, but the more I look at the latter the more doubts I get about its implementation.
Lines 1596 to 1613 in bd5412c
karpRabin src | |
| length src < lp = (src,empty) | |
| otherwise = search (rollingHash $ unsafeTake lp src) lp | |
where | |
k = 2891336453 :: Word32 | |
rollingHash = foldl' (\h b -> h * k + fromIntegral b) 0 | |
hp = rollingHash pat | |
m = k ^ lp | |
get = fromIntegral . unsafeIndex src | |
search !hs !i | |
| hp == hs && pat == unsafeTake lp b = u | |
| length src <= i = (src,empty) -- not found | |
| otherwise = search hs' (i + 1) | |
where | |
u@(_, b) = unsafeSplitAt (i - lp) src | |
hs' = hs * k + | |
get i - | |
m * get (i - lp) |
What's the reason for Karp-Rabin here? It is great to search for multiple patterns at once, but this is not our case. I suspect that for non-pathological cases even a naive loop with memcmp
could very well be faster. And for pathological inputs Karp-Rabin is O(mn) anyways. If we want to fix the worst case scenario, we should employ Knuth-Moris-Pratt or Boyer-Moore.
ad-si