Skip to content

๐Ÿ—„๏ธ A high-performance batch downloader for DingTalk documents built with TypeScript, Puppeteer, and Crawlee.

License

Notifications You must be signed in to change notification settings

imyelo/dingdocs-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DingDocs Crawler

A high-performance batch downloader for DingTalk documents built with TypeScript, Puppeteer, and Crawlee.

English | ไธญๆ–‡ README

Features

  • โšก High Performance: Built with Crawlee framework for efficient web scraping and file downloading
  • ๐Ÿ“„ Multi-format Support: Currently handles various DingTalk document types:
    • Documents
    • Spreadsheets
    • Mind Maps
    • AI Tables
    • Uploaded files (PDF, images, etc.)
    • Nested folders
  • ๐Ÿ›ก๏ธ Stable & Reliable: Stealth mode, retry mechanism, and comprehensive error handling

Prerequisites

  • Bun >= 1.2.20

Installing Bun

Using asdf (Recommended)

If you have asdf installed:

# Install bun plugin
asdf plugin add bun

# Install bun (version specified in .tool-versions)
asdf install bun

Manual Installation

Visit bun.sh for installation instructions.

Installation

  1. Clone the repository:
git clone https://github.com/imyelo/dingdocs-crawler.git
cd dingdocs-crawler
  1. Install dependencies:
bun install

Configuration

The crawler uses environment variables for configuration. Create a .env.local file in the project root:

APP_ENTRY_URL=https://your-dingtalk-docs-url-with-folder-page

What's a Folder Page?

Example:

image.png

Configurable Environment Variables

Variable Description Default Required
APP_ENTRY_URL Starting URL for crawling, should be a folder page - โœ…
APP_CRAWLER_TIMEOUT_SECONDS Total crawler timeout 4500 โŒ
APP_REQUEST_TIMEOUT_SECONDS Individual request timeout 1800 โŒ
APP_VISIBLE Show browser window true โŒ
APP_MAX_CONCURRENCY Maximum concurrent requests 1 โŒ
APP_MAX_REQUEST_RETRIES Retry attempts for failed requests 10 โŒ
APP_PROXY_URLS Comma-separated proxy URLs - โŒ
APP_LOG_PATH Log file directory ./output.log โŒ
APP_DOWNLOAD_PATH Download directory ./downloads โŒ
APP_LOGTAIL_SOURCE_TOKEN Logtail integration token (keep empty if you don't know what it is) - โŒ
APP_HEALTHY_UUID Health check UUID (keep empty if you don't know what it is) - โŒ

Usage

Basic Usage

Start the crawler:

bun start

View Logs

Monitor logs in real-time:

bun run log

License

Apache-2.0 ยฉ yelo, 2025 - present

About

๐Ÿ—„๏ธ A high-performance batch downloader for DingTalk documents built with TypeScript, Puppeteer, and Crawlee.

Topics

Resources

License

Stars

Watchers

Forks