Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
976e416
new Ahrefs success story
sabine Dec 12, 2024
26f6ae0
fmt
sabine Dec 12, 2024
937fa17
Update data/success_stories/ahrefs.md
sabine Dec 14, 2024
96bebec
Update data/success_stories/ahrefs.md
sabine Dec 14, 2024
b4a9510
Update data/success_stories/ahrefs.md
sabine Dec 14, 2024
04e29bc
clarification
sabine Dec 14, 2024
64241b8
Update data/success_stories/ahrefs.md
sabine Dec 14, 2024
94f38c4
Update data/success_stories/ahrefs.md
sabine Dec 14, 2024
df9f84b
be more vague on number of requests frontend/backend
sabine Dec 14, 2024
698b5b8
devkit / bindings
sabine Dec 14, 2024
4f826ee
Update data/success_stories/ahrefs.md
sabine Dec 14, 2024
691850f
Update src/ocamlorg_web/lib/redirection.ml
sabine Dec 14, 2024
28c9934
rewrite taking into account feedback, reframe around always being an …
sabine Jun 18, 2025
735d12a
add relevant BuckleScript -> ReScript context
sabine Jun 18, 2025
bb8e8ac
edits
sabine Jul 1, 2025
696f5fd
two success stories
sabine Jul 1, 2025
c25de3a
new image for full stack story
sabine Jul 1, 2025
d9a60d7
Update data/success_stories/ahrefs-full-stack-web.md
sabine Jul 4, 2025
74a9551
addressing @davesnx review, thanks Dave
sabine Jul 4, 2025
f273424
shorten list of why reasons
sabine Jul 4, 2025
faef1d8
remove redirect bc it's two stories
sabine Jul 11, 2025
b3e447e
redirect for title change of old ahrefs story
sabine Jul 11, 2025
c804417
Apply suggestions from code review @Khady
sabine Jul 25, 2025
5ed211a
Apply suggestions from code review
sabine Jul 25, 2025
01e8710
editing
sabine Aug 12, 2025
2cb8ab8
remove full stack web success story (moved to another PR)
sabine Oct 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions data/success_stories/ahrefs-petabyte-crawler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
title: Petabyte-Scale Web Crawling and Data Processing
logo: success-stories/ahrefs.svg
card_logo: success-stories/white/ahrefs.svg
background: /success-stories/ahrefs-bg.jpg
theme: blue
synopsis: "Ahrefs built the world's third-largest web crawler using OCaml, indexing petabytes of web data with a lean, efficient team."
url: https://ahrefs.com/
priority: 2
why_ocaml_reasons:
- Performance
- Reliability
- Expressiveness
- Scalability
- Maintainability
---

## Challenge

[Ahrefs](https://ahrefs.com/) is a Singapore-based SaaS company that provides comprehensive SEO tools and marketing intelligence powered by big data. Since 2011, they've been crawling the entire web daily to maintain extensive databases of backlinks, keywords, and website analytics that help businesses with SEO strategy, competitor analysis, and content optimization. Today, they're trusted by 44% of Fortune 500 companies.

Building and operating a web crawler at internet scale presents extraordinary challenges. Ahrefs needs to index billions of web pages continuously, process petabytes of data in real-time, and turn this massive dataset into actionable insights for thousands of customers worldwide. The technical demands are staggering: their systems must handle **500 billion backend requests per day** while maintaining **over 100PB of storage**.

As a self-funded company, Ahrefs couldn't solve these challenges by throwing unlimited resources at the problem. They needed maximum efficiency from a small team — systems that could run reliably for months without intervention, code that could be understood and maintained by a lean engineering organization, and performance that could compete with tech giants despite having a fraction of their headcount.

The question wasn't just whether they could build a web-scale crawler, but whether they could do it sustainably with the constraints of a bootstrapped company.

## Result

Over a decade later, Ahrefs operates one of the world's most sophisticated web crawling operations. Their OCaml-powered systems maintains an index of **492.7 billion pages** across **500.4 million domains**.

This technical achievement translates directly to business success. Ahrefs has grown into a **$100M+ ARR company** with **150 employees** managing **4000+ servers**—all while maintaining their original philosophy of operational efficiency. They've become the sector leader in SEO tools, proving that the right technology choices can create sustainable competitive advantages.

The reliability of their OCaml systems is perhaps most impressive: programs written years ago continue running without surprises, requiring minimal maintenance from their engineering team. This "boring" reliability has allowed Ahrefs to focus engineering effort on building new features and capabilities rather than fighting infrastructure fires.

Their success demonstrates that OCaml can power not just technical excellence at massive scale, but sustainable business growth in highly competitive markets.

## Solution

Ahrefs built their crawling infrastructure around OCaml's strengths, creating a distributed system that balances performance, reliability, and maintainability. **[OCaml](https://ocaml.org/)** serves as the primary language for all crawling and data processing systems, compiled natively for maximum performance across their **4000+ servers**.

Their architecture treats data consistency as paramount. Defining shared data structures (using **[ATD (Adjustable Type Definitions)](https://github.com/ahrefs/atd)**, and now moving to [melange-json](https://github.com/melange-community/melange-json)), they ensure type safety throughout their processing pipeline — from initial web crawling to final data storage. This approach catches schema mismatches at compile time rather than at runtime, crucial when processing billions of pages daily.

Their storage layer combines **[ClickHouse](https://clickhouse.com/)**, **[MySQL](https://www.mysql.com/)**, **[Elasticsearch](https://www.elastic.co/)**. The key insight was designing these systems to work together seamlessly through shared OCaml types rather than complex API layers.

Ahrefs maintains their own libraries and frameworks rather than relying on generic solutions. This "build it ourselves" philosophy requires more initial investment but delivers systems perfectly tailored to web crawling demands. Their **1.5 million lines of OCaml code** represent years of accumulated domain expertise encoded in reliable, maintainable software.

The result is a unified system where improvements to crawling algorithms, data processing pipelines, or storage efficiency can be implemented quickly and deployed confidently across their entire infrastructure.

## Why OCaml

* **Low maintenance burden**: OCaml systems built years ago continue running without intervention, allowing engineers to focus on new development rather than troubleshooting production issues.
* **Static typing catches errors**: At petabyte scale, compile-time type checking prevents data format inconsistencies and runtime failures that would be expensive to debug in production environments processing large volumes of web data.
* **Language expressiveness reduces development time**: OCaml's abstractions enabled building domain-specific systems efficiently rather than adapting existing frameworks. Small teams could develop complex crawling and data processing systems with relatively few lines of code.
* **Performance**: Native compilation provides the throughput needed for processing billions of daily requests while maintaining code readability for long-term maintenance.
* **Cost-effective specialized tooling**: OCaml made it practical to build custom systems tailored to specific requirements rather than using general-purpose solutions, which aligned with their business constraints of limited engineering resources.
30 changes: 0 additions & 30 deletions data/success_stories/ahrefs.md

This file was deleted.

2 changes: 2 additions & 0 deletions src/ocamlorg_web/lib/redirection.ml
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,8 @@ let from_v2 =
("/docs/platform-users", Url.tool_page "platform-users");
("/docs/platform-roadmap", Url.tool_page "platform-roadmap");
("/docs/configuring-your-editor", Url.tutorial "set-up-editor");
( "/success-stories/peta-byte-scale-web-crawler",
Url.success_story "peta-byte-scale-web-crawling-and-data-processing" );
]

let make ?(permanent = false) t =
Expand Down
Loading