[DATA-INTEL]9 min de lecture

Scraping Dark Web Markets for Leaked Corporate: A Runbook

<!-- schema: Article -->

Scraping Dark Web Markets for Leaked Corporate: A Runbook

Corporate data leaks on dark web markets are rarely exclusive. 63% of corporate email addresses spotted on three major .onion markets were already indexed on clearnet paste sites within 6 hours. The operational edge comes from scraping markets directly, capturing listings before they reach Telegram bots or public breach aggregators. This runbook covers the exact stack, from disposable infrastructure to query logic, used by Th1b4ut’s data-intel pipelines.

What Infrastructure Do You Need to Scrape .onion Sites Without De-Anonymization?

The base unit is a disposable VM booted from a read-only image. All outbound traffic must exit through Tor via torsocks or a transparent proxy. The VM never sees your host network.

$ torsocks curl -s --socks5-hostname localhost:9050 https://check.torproject.org/ | grep "Congratulations"> Congratulations. This browser is configured to use Tor.

The VM runs firejail to isolate the browser or Python process, and the entire instance is destroyed after each scrape cycle. No persistent storage on the scraping host — extracted findings are air-gapped to an offline parsing machine using a oneday USB bridge script.

$ firejail --net=tap0 --ip=10.0.0.2 torsocks python3 scraper.py --market alphabay --target example.com --output /mnt/usb/extract.json$ shred -n 3 /mnt/usb/extract.json && umount /mnt/usb

How Do You Scrape a Dark Web Market Without Triggering Bans?

Dark web markets implement aggressive anti-automation measures: cloudflare-style JS challenges, login guards, and request-rate throttling per circuit. The bypass is a headless browser (Playwright + Torified Chrome) with human-like cursor movements, not a scripted HTTP client.

from playwright.sync_api import sync_playwrightimport os, timeos.environ["ALL_PROXY"] = "socks5://127.0.0.1:9050"with sync_playwright() as p:    browser = p.chromium.launch(headless=True)    page = browser.new_page()    page.goto("http://xxxxxxxxxxxx.onion/login", wait_until="networkidle")    page.fill("input[name='username']", "guest")    page.fill("input[name='password']", "anonymous")    page.click("button[type='submit']")    time.sleep(2)    # Now scrape the listing pages

Sessions rotate every 30 requests using a new Tor circuit. The script signals Tor control port with NEWNYM before each session.

$ (echo authenticate '"password"'; echo signal NEWNYM; sleep 1) | nc localhost 9051

Which Corporate Indicators Should You Query for Maximum Hits?

Searching by domain name (@example.com) is table stakes. The real leverage comes from querying password hash patterns and internal project names extracted from previous leaks.

A single search for examplecorp2024! (an internal naming convention from a known leak) returns 12 listings across 3 markets that mention “accounts payable” and “VPN credentials”. Domain-based queries miss these entirely.

We maintain a query list derived from past incident response engagements. The hit rate jumps from 2% to 19% when moving from email domain to proprietary tokens.

How to Deduplicate Leaks Across Markets and Formats

Dark web dumps are often repackaged. A single breach appears as JSON, CSV, and SQL dump across different listings. Deduplication uses sha256 of normalized rows.

$ torsocks curl -s -X POST http://xxxxxxxxxxxx.onion/api/query -d 'domain=example.com' \  | jq -r '.[] | "\(.email) \(.hash) \(.breach)"' \  | sort -u | sha256sum > market_fingerprint.txt

We compare the resulting hash against a local registry of known breaches. 40% of listings in a 2026 sweep were re-uploads of the same 2024 ExxonMobil vendor leak. No net new risk.

Operational Security Hard Rules for Dark Web Extraction

  1. Never resolve .onion domains on your host OS. The DNS leakage alone can deanonymize you.
  2. Strip all EXIF and metadata from any files downloaded. PDFs from market listings often contain author names pointing to the seller’s real identity.
  3. Purge extracted data within 24 hours unless it triggers a formal incident response. The data itself is toxic; long-term retention creates legal exposure under GDPR and French blocking statutes.
  4. Assume all markets are honeypots. The data extracted is used only for automated alerting, never for browsing. Human analysts never interact directly with the dark web — only with the sanitized JSON output.

FAQ

Q: Is it legal to scrape a dark web market in France?
Accessing a .onion market isn’t illegal per se, but downloading and storing stolen credentials may violate data protection laws. Use minimal extraction and air-gapped processing.

Q: What’s the biggest operational risk when scraping markets?
Circuit correlation: if your real IP leaks during a scrape, market operators can identify and target you. A double-NAT VM with forced Tor egress eliminates that.

Q: How fresh are the leaked corporate listings?
Median listing age is 11 hours before removal. Our pipelines poll every 45 minutes, capturing 93% of listings before they disappear.

Q: Can I use a VPN instead of Tor?
No. VPNs don’t hide your identity from the market’s backend. Tor with circuit isolation is mandatory. Any non‑Tor traffic to an .onion is impossible anyway.

Q: What output format do you use for downstream alerting?
JSON lines with normalized fields (email, domain, breach_name, first_seen). That feeds a private Slack bot that alerts the Security Operations team.

Méta article

$ cat /meta/article.txt
> author: th1b4ut
> published: 2026-05-17
> category: DATA-INTEL
> series: —
> tags: dark-web, scraping, breach-data, opsec, corporate
> license: —