discover_all_pages

Comprehensive page discovery using sitemaps, crawling, and intelligent mapping

Overview

Before you can clone a site, you need to know every page it has. This tool combines multiple discovery methods — XML sitemap parsing, recursive HTML crawling, and intelligent page mapping — to find every page on a WordPress site, including hidden ones that sitemaps miss.

How It Works

  1. Parses XML sitemaps (sitemap.xml, sitemap_index.xml, wp-sitemap.xml) for known pages.
  2. Crawls the site recursively, following internal links to discover pages not in the sitemap.
  3. Uses intelligent mapping to find JavaScript-rendered routes and orphaned pages.
  4. Normalizes all URLs to prevent duplicates (trailing slashes, query params).
  5. Classifies each page by type (homepage, product, blog post, category, etc.).
  6. Creates a cloning job in the database for tracking progress.

Input Parameters

ParameterTypeRequiredDescription
source_url string required Homepage URL of the WordPress site
clone_domain string optional Domain of the clone site (for automatic URL mapping)
max_pages number optional Maximum pages to discover (default: 5000)
job_id string optional Existing job ID to update instead of creating new

What You Get Back

Example Use Case

Run this on telepresencerobots.com and it finds 983 unique pages across multiple discovery methods — including product pages, category archives, and blog posts that the sitemap didn't list.

Tips

Always provide clone_domain if you know it — this pre-maps source URLs to clone URLs for compare_page_pair.
Discovery runs are deduplicated, so re-running on the same domain won't create duplicate entries.
The job ID returned here is used by compare_page_pair, run_full_comparison, and get_migration_status to track your cloning progress.
URL normalization automatically strips trailing slashes and collapses duplicates.

Related Tools