get_media_list
Read persisted media inventory; trigger a re-extraction pass over stored page HTML
Overview
Returns the media asset inventory the post-scan pipeline already extracted from stored HTML and persisted to the database for a clone job. This tool does not live-fetch the source site. Set force_crawl=true to enqueue the same media_extraction pipeline step that the dashboard "Rescan Media" button uses — it re-extracts from stored page HTML, it never re-crawls the source.
How It Works
- Resolves a clone job for the caller from `job_id` (preferred) or by matching `site_url`'s hostname against `clone_job.source_domain` for the current user.
- Default mode (no `force_crawl`): reads the persisted media inventory for the resolved job from the `media_asset` table via `getPersistedMediaForJob`.
- If no job is resolved, returns a clear "no clone job found" response — no network request is made to the site.
- If a job exists but extraction has not produced any rows yet, returns a structured `no_media_yet` response telling you to call again with `force_crawl: true` (or wait for the post-comparison pipeline to run).
- When `force_crawl=true`, enqueues the `media_extraction` pipeline step for the resolved job (same call path as the dashboard rescan button) and returns the queue status. Re-extraction reads stored page HTML — it does NOT re-crawl the source site.
- Once the queued step completes, call this tool again without `force_crawl` to read the refreshed persisted results.
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
site_url |
string |
required | URL of a clone job's source site. Used to look up the most recent clone job for the current user when job_id is not provided. No network request is made to this URL. |
job_id |
string |
optional | Clone job ID to read persisted media results from. Takes precedence over site_url-based lookup. |
force_crawl |
boolean |
optional | If true, enqueues the media_extraction pipeline step for the resolved job. Requires a resolvable clone job. Re-extracts from stored page HTML — does not re-crawl the source site. |
What You Get Back
- Persisted media inventory: source URLs, pages where each asset was found, file names, and suggested destination paths.
- File type classification (image, video, audio, font, file).
- Variant grouping — WordPress thumbnail variants linked to their base image URL.
- Crawl timestamp from the last successful comparison and a `data_source_note` reflecting the persisted nature of the data.
- When `force_crawl=true`: queue status with `status_field: "media_extraction"`, `run_number`, and any warning (e.g. stale prereq).
Example Use Case
After a clone comparison finishes, call get_media_list with the source site_url to see every image, PDF, video, and font the pipeline catalogued — grouped by type, with thumbnail variants linked to their originals. If the inventory is missing newly added assets, call again with force_crawl: true to re-extract from stored HTML, then re-read the persisted results.
