get_media_list

Read persisted media inventory; trigger a re-extraction pass over stored page HTML

Overview

Returns the media asset inventory the post-scan pipeline already extracted from stored HTML and persisted to the database for a clone job. This tool does not live-fetch the source site. Set force_crawl=true to enqueue the same media_extraction pipeline step that the dashboard "Rescan Media" button uses — it re-extracts from stored page HTML, it never re-crawls the source.

How It Works

  1. Resolves a clone job for the caller from `job_id` (preferred) or by matching `site_url`'s hostname against `clone_job.source_domain` for the current user.
  2. Default mode (no `force_crawl`): reads the persisted media inventory for the resolved job from the `media_asset` table via `getPersistedMediaForJob`.
  3. If no job is resolved, returns a clear "no clone job found" response — no network request is made to the site.
  4. If a job exists but extraction has not produced any rows yet, returns a structured `no_media_yet` response telling you to call again with `force_crawl: true` (or wait for the post-comparison pipeline to run).
  5. When `force_crawl=true`, enqueues the `media_extraction` pipeline step for the resolved job (same call path as the dashboard rescan button) and returns the queue status. Re-extraction reads stored page HTML — it does NOT re-crawl the source site.
  6. Once the queued step completes, call this tool again without `force_crawl` to read the refreshed persisted results.

Input Parameters

ParameterTypeRequiredDescription
site_url string required URL of a clone job's source site. Used to look up the most recent clone job for the current user when job_id is not provided. No network request is made to this URL.
job_id string optional Clone job ID to read persisted media results from. Takes precedence over site_url-based lookup.
force_crawl boolean optional If true, enqueues the media_extraction pipeline step for the resolved job. Requires a resolvable clone job. Re-extracts from stored page HTML — does not re-crawl the source site.

What You Get Back

Example Use Case

After a clone comparison finishes, call get_media_list with the source site_url to see every image, PDF, video, and font the pipeline catalogued — grouped by type, with thumbnail variants linked to their originals. If the inventory is missing newly added assets, call again with force_crawl: true to re-extract from stored HTML, then re-read the persisted results.

Tips

Media extraction runs automatically as a post-scan stage after page comparisons complete — you usually do not need force_crawl.
force_crawl re-extracts from stored page HTML; it does not re-crawl the source site. To pick up newly added assets, run a fresh comparison first.
Pass `job_id` to skip the site_url → job lookup and target a specific clone job.
The empty-inventory response (`status: "no_media_yet"`) means the job exists but extraction has not produced data — distinct from the "no clone job found" response.

Related Tools