Search algorithms

I've been thinking about `splitOn` and `replace` for `ByteString`. They can be expressed in terms of [`breakSubstring`](http://hackage.haskell.org/package/bytestring-0.11.0.0/docs/Data-ByteString.html#v:breakSubstring), but the more I look at the latter the more doubts I get about its implementation.

https://github.com/haskell/bytestring/blob/bd5412c1b7fac3f63cc1d2ea4e75bdf45c04b541/Data/ByteString.hs#L1596-L1613

What's the reason for Karp-Rabin here? It is great to search for multiple patterns at once, but this is not our case. I suspect that for non-pathological cases even a naive loop with `memcmp` could very well be faster. And for pathological inputs Karp-Rabin is _O(mn)_ anyways. If we want to fix the worst case scenario, we should employ Knuth-Moris-Pratt or Boyer-Moore.

	karpRabin src
	\| length src < lp = (src,empty)
	\| otherwise = search (rollingHash $ unsafeTake lp src) lp
	where
	k = 2891336453 :: Word32
	rollingHash = foldl' (\h b -> h * k + fromIntegral b) 0
	hp = rollingHash pat
	m = k ^ lp
	get = fromIntegral . unsafeIndex src
	search !hs !i
	\| hp == hs && pat == unsafeTake lp b = u
	\| length src <= i = (src,empty) -- not found
	\| otherwise = search hs' (i + 1)
	where
	u@(_, b) = unsafeSplitAt (i - lp) src
	hs' = hs * k +
	get i -
	m * get (i - lp)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Search algorithms #307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search algorithms #307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions