pdf-mcp

A Model Context Protocol (MCP) server that enables AI agents to read, search, and extract content from PDF files. Built with Python and PyMuPDF, with SQLite-based caching for persistence across server restarts.

mcp-name: io.github.jztan/pdf-mcp

Features

8 specialized tools for different PDF operations
SQLite caching — persistent cache survives server restarts (essential for STDIO transport)
Paginated reading — read large PDFs in manageable chunks
Full-text search — find content without loading the entire document
Image extraction — extract images as base64 PNG
URL support — read PDFs from HTTP/HTTPS URLs

Installation

pip install pdf-mcp

Quick Start

Claude Code

claude mcp add pdf-mcp -- pdf-mcp

Or add to ~/.claude.json:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Config file location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

Restart Claude Desktop after updating the config.

Visual Studio Code

Requires VS Code 1.102+ with GitHub Copilot.

CLI:

code --add-mcp '{"name":"pdf-mcp","command":"pdf-mcp"}'

Command Palette:

Open Command Palette (Cmd/Ctrl+Shift+P)
Run MCP: Open User Configuration (global) or MCP: Open Workspace Folder Configuration (project-specific)

Add the configuration:

{
  "servers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Save. VS Code will automatically load the server.

Manual: Create .vscode/mcp.json in your workspace:

{
  "servers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Codex CLI

codex mcp add pdf-mcp -- pdf-mcp

Or configure manually in ~/.codex/config.toml:

[mcp_servers.pdf-mcp]
command = "pdf-mcp"

Kiro

Create or edit .kiro/settings/mcp.json in your workspace:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp",
      "args": [],
      "disabled": false
    }
  }
}

Save and restart Kiro.

Other MCP Clients

Most MCP clients use a standard configuration format:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

With uvx (for isolated environments):

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "uvx",
      "args": ["pdf-mcp"]
    }
  }
}

Verify Installation

pdf-mcp --help

Tools

`pdf_info` — Get Document Information

Returns page count, metadata, table of contents, file size, and estimated token count. Call this first to understand a document before reading it.

"Read the PDF at /path/to/document.pdf"

`pdf_read_pages` — Read Specific Pages

Read selected pages to manage context size.

"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"

`pdf_read_all` — Read Entire Document

Read a complete document in one call. Subject to a safety limit on page count.

"Read the entire PDF (it's only 10 pages)"

`pdf_search` — Search Within PDF

Find relevant pages before loading content.

"Search for 'quarterly revenue' in the PDF"

`pdf_get_toc` — Get Table of Contents

"Show me the table of contents"

`pdf_extract_images` — Extract Images

"Extract images from pages 1-5"

`pdf_cache_stats` — View Cache Statistics

"Show PDF cache statistics"

`pdf_cache_clear` — Clear Cache

"Clear expired PDF cache entries"

Example Workflow

For a large document (e.g., a 200-page annual report):

User: "Summarize the risk factors in this annual report"

Agent workflow:
1. pdf_info("report.pdf")
   → 200 pages, TOC shows "Risk Factors" on page 89

2. pdf_search("report.pdf", "risk factors")
   → Relevant pages: 89-110

3. pdf_read_pages("report.pdf", "89-100")
   → First batch

4. pdf_read_pages("report.pdf", "101-110")
   → Second batch

5. Synthesize answer from chunks

Caching

The server uses SQLite for persistent caching. This is necessary because MCP servers using STDIO transport are spawned as a new process for each conversation.

Cache location: ~/.cache/pdf-mcp/cache.db

What's cached:

Data	Benefit
Metadata	Avoid re-parsing document info
Page text	Skip re-extraction
Images	Skip re-encoding
TOC	Skip re-parsing

Cache invalidation:

Automatic when file modification time changes
Manual via the pdf_cache_clear tool
TTL: 24 hours (configurable)

Configuration

Environment variables:

# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache

# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48

Development

git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Type checking
mypy src/

# Linting
flake8 src/

# Formatting
black src/

Why pdf-mcp?

	Without pdf-mcp	With pdf-mcp
Large PDFs	Context overflow	Chunked reading
Repeated access	Re-parse every time	SQLite cache
Finding content	Load everything	Search first
Tool design	Single monolithic tool	8 specialized tools

Contributing

Contributions are welcome. Please submit a pull request.

License

MIT — see LICENSE.

pdf-mcp

pdf-mcp

pdf-mcp

Features

Installation

Quick Start

Verify Installation

Tools

pdf_info — Get Document Information

pdf_read_pages — Read Specific Pages

pdf_read_all — Read Entire Document

pdf_search — Search Within PDF

pdf_get_toc — Get Table of Contents

pdf_extract_images — Extract Images

pdf_cache_stats — View Cache Statistics

pdf_cache_clear — Clear Cache

Example Workflow

Caching

Configuration

Development

Why pdf-mcp?

Contributing

License

Links

`pdf_info` — Get Document Information

`pdf_read_pages` — Read Specific Pages

`pdf_read_all` — Read Entire Document

`pdf_search` — Search Within PDF

`pdf_get_toc` — Get Table of Contents

`pdf_extract_images` — Extract Images

`pdf_cache_stats` — View Cache Statistics

`pdf_cache_clear` — Clear Cache