anonymizer
Offline · Open Source · AGPL-3.0

Offline PII redactor for legal documents.

Replaces names, companies, legal and financial IDs, addresses, emails and phones with structured tokens in .docx, .pdf and .xlsx, including scanned PDFs when local OCR is available. Runs locally. Russian and English, plus optional Spanish. No telemetry.

$ curl -fsSL anonymizer.site/install | sh
Zero network macOS · Windows · Linux AGPL-3.0

Why this exists

Lawyers want AI feedback on contracts but can't paste raw client data into third-party tools. Manual redaction is slow and error-prone, especially for scanned documents. anonymizer automates the redaction step locally so the rest of the AI workflow stays unchanged.

What it strips

Stable tokens that preserve grammatical position. Numbering is consistent within a session.

Names
John Smith [Person_1]
Companies
Acme Corp. [Company_1]
Emails
[email protected] [EMAIL_1]
Phones
+1 (415) 555-1234 [PHONE_1]
Tax / legal IDs
EIN 12-3456789 [NUMBER_1]
IBANs
GB29 NWBK 6016... [NUMBER_1]
Cards
4276 1300 ... [NUMBER_1]
Addresses
1 Main St, New York [ADDRESS_1]
Dates
03/12/2024 [DATE_1]
Contract refs
Contract No. SVC-2025-0847 [NUMBER_1]
Passport details
passport 45 11 123456 [NUMBER_1]

How it works

Drop a file

Drag a .docx, .pdf or .xlsx into the local web UI. Scanned PDFs use local Tesseract OCR when installed.

Detect locally

Natasha + spaCy run on your CPU. Regex catches structured PII. Never opens a socket.

Get tokenized doc

Structure preserved, metadata cleared. Original file untouched.

No data leaves your laptop. Ever.

Install

macOS / Linux
$ curl -fsSL anonymizer.site/install | sh
Windows PowerShell
$ iwr -useb anonymizer.site/install.ps1 | iex
Manual / corporate
$ uv tool install docs-anonymizer

See /docs/installation/manual for SHA256 and offline mirror options.

Why it's safe to install

Zero network in core

A build-time test asserts the redaction engine opens no socket. Any regression in the network policy fails CI before a release ships.

Open source AGPL-3.0

Full source ships as sdist alongside the wheel on PyPI.

No telemetry

Feedback is opt-in via an in-UI button. No passive analytics, ever.

Published SHA256

Every release publishes the wheel SHA256 in /version.json. Verify what 'uv tool install' got you against the manifest before trusting it on a sensitive machine.

Using ChatGPT on your contracts?

Strip the personal data locally first. Our step-by-step guide shows the safe workflow — and how anonymizer compares to other tools.

FAQ

Does it work on scanned PDFs?
Yes for printed scanned and hybrid PDFs when local Tesseract OCR is installed with English and Russian language packs. Handwriting and embedded images in DOCX/XLSX still require manual review.
Why AGPL?
Because we use PyMuPDF. Sdist ships alongside the wheel.
Can I disable detectors I don't need?
Yes, per-category toggles are available in the UI for the current session.
Does it phone home?
During document processing, no. On startup the app makes one outbound request — a GET to anonymizer.site/version.json with a 2-second timeout — to surface the in-app update banner. No user data is sent. To disable it, launch as anonymize --no-update-check. Optional OCR setup during install may use your package manager to install Tesseract.
Will tokens stay consistent across runs?
Within a session yes; across sessions no — by design. See Token Manager docs.
Does it run on Linux?
Yes. Use curl -fsSL anonymizer.site/install | sh or install manually with uv tool install docs-anonymizer. On non-apt distros, install Tesseract OCR language packs manually if you need scanned PDFs.
Is it stable for production legal work?
It is pilot-grade; the golden corpus enforces ≥99% recall on P0 categories. Decide for yourself.