discover_all_pages
Comprehensive page discovery using sitemaps, crawling, and intelligent mapping
Overview
Before you can clone a site, you need to know every page it has. This tool combines multiple discovery methods — XML sitemap parsing, recursive HTML crawling, and intelligent page mapping — to find every page on a WordPress site, including hidden ones that sitemaps miss.
How It Works
- Parses XML sitemaps (sitemap.xml, sitemap_index.xml, wp-sitemap.xml) for known pages.
- Crawls the site recursively, following internal links to discover pages not in the sitemap.
- Uses intelligent mapping to find JavaScript-rendered routes and orphaned pages.
- Normalizes all URLs to prevent duplicates (trailing slashes, query params).
- Classifies each page by type (homepage, product, blog post, category, etc.).
- Creates a cloning job in the database for tracking progress.
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
source_url |
string |
required | Homepage URL of the WordPress site |
clone_domain |
string |
optional | Domain of the clone site (for automatic URL mapping) |
max_pages |
number |
optional | Maximum pages to discover (default: 5000) |
job_id |
string |
optional | Existing job ID to update instead of creating new |
What You Get Back
- Job ID for tracking the cloning job
- Total pages discovered
- Page list with URLs, titles, and detected types
- Source-to-clone URL mappings (when clone_domain provided)
Example Use Case
Run this on telepresencerobots.com and it finds 983 unique pages across multiple discovery methods — including product pages, category archives, and blog posts that the sitemap didn't list.
Tips
Always provide clone_domain if you know it — this pre-maps source URLs to clone URLs for compare_page_pair.
Discovery runs are deduplicated, so re-running on the same domain won't create duplicate entries.
The job ID returned here is used by compare_page_pair, run_full_comparison, and get_migration_status to track your cloning progress.
URL normalization automatically strips trailing slashes and collapses duplicates.
