# PDF Metadata Is Leaking Your Personal Information: What's Hidden and How to Remove It

> Every PDF carries metadata: author names, edit history, software versions, GPS data. What is hiding in your files and why stripping it matters.

- **Author:** Neetusha
- **Published:** 2026-05-23
- **URL:** https://www.redactifyai.com/blog/pdf-metadata-privacy-risks/

---

In 2005, the Italian intelligence agency SISMI submitted a document to Reuters about the abduction of Abu Omar in Milan. The PDF looked clean. But its metadata contained the names of two SISMI officers and the document's creation and modification timestamps. That information helped establish when the agency knew about the abduction and who had been involved. The metadata contradicted the official timeline and contributed to the prosecution of 26 CIA operatives and 7 SISMI agents.

In 2003, the UK government published a dossier on Iraq's weapons capabilities as a PDF. The document's metadata revealed that it had been authored by a junior press officer, not intelligence analysts, and that sections had been copied from a 12-year-old graduate thesis. The discovery undermined the dossier's credibility and became a central element of the political controversy that followed.

These are not niche technical problems. Every PDF carries metadata. Most of it is invisible during normal viewing. Some of it can expose personal information, organizational details, and document history that the publisher never intended to share.

**PDF metadata** is structured information embedded in a PDF file that describes the document itself: who created it, when, with what software, and its revision history. Unlike the visible content on the page, metadata lives in separate data structures within the file and doesn't appear during normal viewing. Most PDF creation tools populate metadata fields automatically, without the author realizing it.

## What metadata lives inside a PDF

PDF files store metadata in multiple locations, each with different levels of visibility and different implications for privacy.

### Document Information Dictionary

The most accessible metadata lives in the PDF's Document Information Dictionary: the properties you see when you open File > Properties in a PDF viewer. Standard fields include:

- **Title**: may auto-populate from the document's first heading or a filename
- **Author**: typically set to the username of the account that created the file (e.g., "jsmith" or "Jane Smith")
- **Subject**: sometimes used for internal document classifications
- **Keywords**: occasionally contains internal tags or project codes
- **Creator**: the application that created the original file (e.g., "Microsoft Word 2021", "Adobe InDesign 2024")
- **Producer**: the application or library that generated the PDF (e.g., "Adobe PDF Library 17.0", "wkhtmltopdf", "Prince 15.1")
- **Creation Date**: when the PDF was first created
- **Modification Date**: when the PDF was last modified

The Author field is the most common privacy leak. When someone creates a PDF from a Word document, the PDF inherits the Author property from the Word file, which defaults to the Microsoft 365 account name or Windows username. Organizations frequently publish "anonymous" reports and legal filings where the Author field reveals exactly who wrote them.

### XMP metadata

The Extensible Metadata Platform (XMP), developed by Adobe, stores metadata in an XML format embedded within the PDF. XMP metadata can duplicate the Document Information Dictionary fields but also stores additional data:

- **Dublin Core properties**: title, description, creator, date, rights, format
- **PDF-specific properties**: PDF version, producer, trapped status
- **Photoshop/image properties**: if the PDF was created from or contains images processed in Photoshop, the XMP block may include color profiles, camera settings, and editing history
- **Rights management**: copyright notices, usage terms, license identifiers
- **Custom schemas**: applications can embed arbitrary metadata using custom XMP namespaces

XMP metadata is worth watching closely because it can survive PDF editing operations that strip the Document Information Dictionary. A PDF that has been "cleaned" in one viewer may still carry XMP metadata that reveals authorship and creation details.

### Incremental saves and revision history

PDF files support incremental saves: when a PDF is modified and saved, the new content is appended to the end of the file while the original content remains. This means a PDF that has been edited may contain every previous version of its content within the same file.

For redaction, the implications are severe. If someone opens a PDF, adds a black rectangle over a Social Security number, and saves the file with an incremental save (which is the default behavior in most PDF editors), the original unredacted content remains in the file. Anyone who opens the file in a hex editor, a forensic PDF tool, or even a PDF library that reads the file's complete object stream can recover the "redacted" content.

Adobe Acrobat Pro's redaction tool addresses this by performing a "save as" operation that rebuilds the file structure after redaction, stripping incremental save data. But most other PDF editors, including many online redaction tools, perform incremental saves that preserve the original content. For more on why [common PDF redaction methods fail](/blog/adobe-redaction-risks-why-not-safe), see our guide on Adobe's approach.

### Embedded file metadata

PDFs can contain embedded files: images, fonts, attachments, and other documents. Each of these embedded files carries its own metadata.

**Image EXIF data** is the most privacy-sensitive. When a photograph is embedded in a PDF, the image's EXIF metadata may include:

- **GPS coordinates**: the exact location where the photo was taken, accurate to within a few meters
- **Camera information**: device make and model (e.g., "iPhone 15 Pro", "Canon EOS R5")
- **Date and time**: when the photograph was taken, often including timezone
- **Serial numbers**: some cameras embed their serial number in EXIF data
- **Thumbnail images**: EXIF data can contain a thumbnail of the original image, which may show content that was subsequently cropped

A PDF containing a photograph of a building for a property listing could reveal the photographer's home address through the GPS coordinates in the image's EXIF data. A legal filing containing photographs as exhibits could expose the device and location of the person who took them.

**Font metadata** is a less obvious vector. Fonts embedded in a PDF may contain licensing information, designer names, and version identifiers that reveal what software and systems were used to create the document.

### JavaScript and form data

PDFs can contain JavaScript code and interactive form fields. Form fields may retain submitted data in their default values even after the form appears to be "blank." JavaScript within a PDF can contain embedded data, URLs, and logic that reveals information about the document's purpose and origin.

### Digital signatures and certificates

Signed PDFs contain certificate information that identifies the signer: name, email address, organization, and sometimes the certificate authority that issued the signing certificate. This is by design (the whole point of a digital signature is identity verification), but it means that a signed PDF inherently cannot be fully anonymized without invalidating the signature.

## Who is at risk from PDF metadata leaks

**Law firms and legal departments.** Legal filings routinely contain author metadata that identifies the drafting attorney, creation dates that reveal when work began, and producer information that shows what software was used. For sealed filings, confidential settlements, or anonymous complaints, this metadata can [undermine the confidentiality](/blog/law-firms-pii-pdf-mistakes) that the document's content was designed to protect.

The [FTC v. Microsoft](https://www.ftc.gov/legal-library/browse/cases-proceedings/2210077-microsoftactivision-blizzard-matter) antitrust trial over the Activision Blizzard acquisition produced a concrete example of what PDF mishandling looks like at scale. In September 2023, trial exhibit PDFs uploaded to the court's public server contained confidential material that had not been properly removed, exposing internal email exchanges, PowerPoint decks, and business strategy documents. The court was forced to pull all trial exhibits from its public server twice during the proceedings. Microsoft was identified as responsible for uploading the problematic files. If it can happen to a company with dedicated legal and compliance teams in a federal antitrust case, it can happen to any firm.

**Healthcare organizations.** PDFs containing patient records, lab results, or insurance documents carry metadata that may identify the creating clinician, the medical records system, and the facility. Under [HIPAA](/blog/hipaa-redaction-requirements-healthcare), this metadata is part of the protected health information that must be safeguarded.

**Government agencies.** FOIA responses, published reports, and policy documents carry metadata that reveals authorship, drafting timelines, and the software environment of the originating agency. The UK Iraq dossier and SISMI cases demonstrate the political consequences. For agencies processing [FOIA requests](/blog/foia-redaction-guide-government-agencies), metadata stripping is a standard requirement.

**Real estate and financial services.** Closing documents, appraisals, and financial reports in PDF format carry author, organization, and timestamp metadata. For [real estate transactions](/blog/redact-documents-real-estate) involving confidential parties or non-disclosure agreements, this metadata can reveal the identity of parties who intended to remain anonymous.

**Journalism and activism.** Whistleblowers and sources who submit documents as PDFs may be identifiable through the author field, creation date, or producer metadata, even if the document content has been sanitized. Press organizations receiving leaked documents should strip metadata before publication.

## How to check what metadata your PDFs contain

### Using Adobe Acrobat

Open the PDF in Adobe Acrobat Pro or Reader. Go to File > Properties. The Description tab shows the Document Information Dictionary fields. The Custom tab shows any custom metadata fields. For XMP metadata, Acrobat Pro offers an additional view under Advanced > Document Properties.

### Using free command-line tools

**[ExifTool](https://exiftool.org/)** (by Phil Harvey) is the most comprehensive metadata reading tool for PDFs. Running `exiftool document.pdf` displays all metadata from the Document Information Dictionary, XMP, and embedded images. It can read EXIF data from images embedded within the PDF, which most PDF viewers don't expose.

**pdfinfo** (part of the Poppler/Xpdf utilities) displays the Document Information Dictionary fields. It's simpler than ExifTool but covers the most common metadata exposure points.

**QPDF** can decompress and display the raw PDF object stream, including metadata objects, incremental save data, and embedded file structures. It's the most thorough tool for forensic PDF analysis.

### What to look for

When inspecting your PDFs for metadata privacy risks, focus on:

1. **Author and Creator fields**: do they reveal an individual's name or username?
2. **Creation and modification dates**: do they reveal timeline information that should be confidential?
3. **Producer field**: does it reveal your software environment (specific versions of commercial software, internal tools)?
4. **Embedded image EXIF**: do any embedded images contain GPS coordinates, camera identifiers, or timestamps?
5. **Incremental save data**: has the PDF been edited in a way that preserves previous versions of the content?
6. **Form field values**: do any form fields retain submitted data?
7. **Digital signature certificates**: do embedded certificates reveal identity information beyond what was intended?

## How to remove PDF metadata

The approach depends on the level of assurance you need.

### Basic: Adobe Acrobat Pro

Acrobat Pro's "Remove Hidden Information" feature (Protection > Remove Hidden Information) scans for and removes metadata, comments, embedded files, hidden layers, and other non-visible content. It's effective for the Document Information Dictionary and most XMP metadata, but it may not catch all embedded image EXIF data or incremental save remnants.

Acrobat Pro's "Sanitize Document" feature goes further, performing a Save As that rebuilds the file and strips incremental save data. For most publishing use cases, this combination is adequate.

### Intermediate: ExifTool

ExifTool can remove all metadata from a PDF:

```bash
exiftool -all= document.pdf
```

This strips the Document Information Dictionary, XMP metadata, and other metadata blocks. However, it doesn't address incremental save data, embedded image EXIF, or form field values. Those require additional processing.

### Thorough: dedicated redaction tools

For documents where metadata removal is a compliance requirement rather than a convenience, a dedicated tool that addresses all metadata layers (Document Information Dictionary, XMP, embedded image EXIF, incremental save data, form fields, and comments) in a single pass provides the assurance that a multi-tool manual process cannot.

The distinction matters when the consequences of missed metadata are regulatory fines or identity exposure rather than minor embarrassment. [Proper PDF redaction](/blog/how-to-redact-pdf-complete-guide) handles both the visible content and the invisible metadata simultaneously.

## The metadata problem in redaction workflows

Here's the irony: many redaction tools focus entirely on the visible content: the names, SSNs, and account numbers that appear on the page, while leaving the document's metadata untouched. You end up with a PDF where every personal identifier in the body text has been carefully removed, but the Author field still says "Jane Smith, Esq." and the creation date reveals exactly when the document was drafted.

A complete redaction workflow builds metadata removal in from the start, not as a separate step at the end. When [RedactifyAI](/) processes a document, it handles both the visible PII (names, identifiers, financial data in the document text) and the invisible metadata (author fields, timestamps, embedded image data) in the same pass. The output is a document that's clean at both levels.

For [organizations building AI pipelines](/blog/redact-documents-before-ai-llm), metadata is an additional concern: document metadata fed into RAG systems or fine-tuning datasets can expose organizational information and individual identities even when the document body has been sanitized.

## A metadata removal checklist

Before publishing, sharing, or archiving any PDF:

- [ ] Check File > Properties for author, creator, and title fields
- [ ] Verify creation and modification dates don't reveal confidential timing
- [ ] Inspect embedded images for EXIF/GPS data
- [ ] Confirm the file has been rebuilt (Save As, not Save) to eliminate incremental save history
- [ ] Check for form fields retaining submitted values
- [ ] Strip comments and annotations that may contain PII
- [ ] Verify custom metadata and XMP properties have been cleared
- [ ] For signed documents, confirm that signature certificate details are appropriate for the audience

This checklist addresses the metadata dimension. For visible content redaction (names, numbers, and identifiers in the document body), see our guide to [redacting documents safely](/blog/how-to-redact-documents-safely).

[Upload a PDF](/tools/redact-pdf-free/) and see what metadata is hiding in your files alongside any personal information in the content. For ongoing metadata management across your organization, [sign up free](https://app.redactifyai.com/auth/signup) or [book a demo](/support).

## Frequently asked questions

### Can PDF metadata reveal who authored an anonymous document?

Yes. The Author field in a PDF's Document Information Dictionary typically auto-populates with the creating user's name or username. If the PDF was generated from a Word document, it inherits Word's Author property, which defaults to the Microsoft 365 account name. Even if the author field is cleared, the XMP metadata may contain a separate dc:creator field with the same information. Multiple metadata locations must be checked and cleared to achieve true anonymity.

### Do screenshots avoid PDF metadata risks?

Partially. Converting a document to a screenshot (image) and then to a PDF eliminates the original document's metadata. However, the image itself may contain EXIF metadata including the device used to take the screenshot, the date and time, and the user's operating system. The resulting PDF will also have its own metadata (author, creation date, producer) populated by whatever tool created the PDF from the image. Screenshots reduce but don't eliminate metadata risk.

### Does emailing a PDF strip its metadata?

No. Email transmission does not modify the contents of attached files. A PDF sent as an email attachment arrives at the recipient with all of its metadata intact. Email servers and clients may add metadata to the email message itself, but they don't alter the attachments. The PDF's author field, creation date, revision history, and embedded image EXIF data all survive email transmission unchanged.

### Can incremental saves really expose redacted content?

Yes. When a PDF is edited and saved incrementally (the default in most editors), the new content is appended to the file while the original content remains in the file's byte stream. If someone "redacts" a PDF by drawing black rectangles over text and then saves the file with an incremental save, the original text is still present in the file. Tools like QPDF can extract previous revisions from incrementally saved PDFs, recovering content that appears to have been removed. This is why proper redaction tools perform a "Save As" that rebuilds the file from scratch.

### What metadata do scanned PDFs contain?

Scanned PDFs (created by scanning a physical document) contain the scanner or device metadata (manufacturer, model, sometimes serial number), scan date and time, scanning software name and version, image resolution and color space, and EXIF data if the scan was performed with a camera or phone. They typically don't contain author or revision metadata since there was no digital document creation process. However, if the scanned PDF is subsequently edited (OCR layer added, annotations applied), the editing tool will add its own metadata to the file: author, producer, modification date.

### Is metadata removal legally required?

Several regulatory frameworks implicitly or explicitly require metadata removal. [HIPAA's de-identification standard](https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html) requires removing 18 categories of identifiers from all associated data, which includes document metadata. [GDPR's data minimization principle](/blog/redact-documents-gdpr-hipaa-compliance) requires that personal data processing be limited to what's necessary. Publishing a document with author metadata that serves no purpose for the recipient violates this principle. Federal court rules in many jurisdictions require stripping metadata from electronically filed documents. And FOIA responses in most agencies include metadata removal as a standard processing step.