get_media_list
Deep media asset discovery with Zyte rendering, variant grouping, and size estimation
Overview
Finds every image, video, font, PDF, and downloadable file on a WordPress site using JavaScript-rendered crawling via Zyte. Extracts media from CSS background-image, srcset, picture elements, OG/Twitter meta tags, favicons, and @font-face declarations. Groups WordPress thumbnail variants to their originals, preserves WP upload directory structure, and estimates total storage size.
How It Works
- Discovers pages via XML sitemap (sitemap.xml, sitemap_index.xml, wp-sitemap.xml) before crawling.
- Fetches each page with Zyte (JavaScript-rendered) when available, falling back to static fetch.
- Extracts media from img, picture, srcset, video, audio, iframe, CSS background-image, OG/Twitter meta, favicons, and @font-face.
- Normalizes WordPress thumbnail suffixes (-300x200, -1024x768) to identify base images and groups variants.
- Preserves wp-content/uploads/ directory structure in suggested paths (e.g., /public/uploads/2024/03/photo.jpg).
- Estimates total storage by sampling file sizes via HEAD requests.
- Optionally scans WP export files for media referenced in content and meta fields.
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
site_url |
string |
required | URL of the WordPress site to scan |
export_path |
string |
optional | Path to a WP export file for additional media discovery |
max_pages |
number |
optional | Maximum pages to scan for media (default: 10) |
What You Get Back
- Complete media inventory with source URLs and pages where each asset was found
- File type classification (image, video, audio, font, file)
- Suggested destination paths preserving WordPress upload structure
- Variant grouping — thumbnail variants linked to base image URL
- Estimated total storage size based on sampled HEAD requests
- Crawl method used (zyte_rendered or fetch_static)
Example Use Case
Before cloning, run this on a WooCommerce site. It discovers 450 product images (with 1,200 thumbnail variants grouped to 450 originals), 12 PDFs, 3 videos, 8 web fonts, and estimates 2.1GB total storage — all with paths matching the original wp-content/uploads/ structure.
Tips
Increase max_pages for sites with lots of media spread across many pages.
Combine with the export file scan to catch media referenced in post content that might not appear on the live site anymore.
The suggested paths preserve WordPress upload directory structure (year/month) for easy migration.
Variant grouping helps you download only original images and regenerate thumbnails on the clone.
Storage estimation samples up to 20 files — accuracy improves with larger image sets.
