get_media_list

Read persisted media inventory; trigger a re-extraction pass over stored page HTML

Overview

Returns the media asset inventory the post-scan pipeline already extracted from stored HTML and persisted to the database for a clone job. This tool does not live-fetch the source site. Set force_crawl=true to enqueue the same media_extraction pipeline step that the dashboard "Rescan Media" button uses — it re-extracts from stored page HTML, it never re-crawls the source.

How It Works

Resolves a clone job for the caller from `job_id` (preferred) or by matching `site_url`'s hostname against `clone_job.source_domain` for the current user.
Default mode (no `force_crawl`): reads the persisted media inventory for the resolved job from the `media_asset` table via `getPersistedMediaForJob`.
If no job is resolved, returns a clear "no clone job found" response — no network request is made to the site.
If a job exists but extraction has not produced any rows yet, returns a structured `no_media_yet` response telling you to call again with `force_crawl: true` (or wait for the post-comparison pipeline to run).
When `force_crawl=true`, enqueues the `media_extraction` pipeline step for the resolved job (same call path as the dashboard rescan button) and returns the queue status. Re-extraction reads stored page HTML — it does NOT re-crawl the source site.
Once the queued step completes, call this tool again without `force_crawl` to read the refreshed persisted results.

Input Parameters

Parameter	Type	Required	Description
`site_url`	`string`	required	URL of a clone job's source site. Used to look up the most recent clone job for the current user when job_id is not provided. No network request is made to this URL.
`job_id`	`string`	optional	Clone job ID to read persisted media results from. Takes precedence over site_url-based lookup.
`force_crawl`	`boolean`	optional	If true, enqueues the media_extraction pipeline step for the resolved job. Requires a resolvable clone job. Re-extracts from stored page HTML — does not re-crawl the source site.

What You Get Back

Persisted media inventory: source URLs, pages where each asset was found, file names, and suggested destination paths.
File type classification (image, video, audio, font, file).
Variant grouping — WordPress thumbnail variants linked to their base image URL.
Crawl timestamp from the last successful comparison and a `data_source_note` reflecting the persisted nature of the data.
When `force_crawl=true`: queue status with `status_field: "media_extraction"`, `run_number`, and any warning (e.g. stale prereq).

Example Use Case

After a clone comparison finishes, call get_media_list with the source site_url to see every image, PDF, video, and font the pipeline catalogued — grouped by type, with thumbnail variants linked to their originals. If the inventory is missing newly added assets, call again with force_crawl: true to re-extract from stored HTML, then re-read the persisted results.

Tips

✓Media extraction runs automatically as a post-scan stage after page comparisons complete — you usually do not need force_crawl.

✓force_crawl re-extracts from stored page HTML; it does not re-crawl the source site. To pick up newly added assets, run a fresh comparison first.

✓Pass `job_id` to skip the site_url → job lookup and target a specific clone job.

✓The empty-inventory response (`status: "no_media_yet"`) means the job exists but extraction has not produced data — distinct from the "no clone job found" response.

get_media_list

Overview

How It Works

Input Parameters

What You Get Back

Example Use Case

Tips

Related Tools