Technical Spec

Fish City Visual Catalog – Catalog v1 Specification

1. Purpose

Catalog v1 defines the minimal integration between the Jon Sarkin Omeka-S catalog (https://catalog.jonsarkin.com) and the Qdrant vector database on hyphae. The goal is to enable GPU-backed visual similarity search over artworks using only:

One image per Omeka item
Title and identifier metadata

2. Scope

Omeka is the canonical data store.
Qdrant holds CLIP image embeddings plus minimal metadata.
No inferred themes, OCR, or rich text metadata (reserved for Catalog v2+).

3. Components

Omeka-S (Artwork resource template, class VisualArtwork).
Qdrant – collection omeka_items.
omeka-sync job – periodically syncs Omeka → Qdrant.

4. Omeka Requirements

Each item included in Catalog v1 MUST satisfy:

Resource template: Artwork (Jon Sarkin) (class VisualArtwork).
Fields:
- Title (string)
- At least one image media marked as the primary representation.

5. Qdrant Collection Schema

Collection name: omeka_items

Point ID: omeka_item_id (integer, from Omeka API).
Vectors:
- visual_vec – CLIP image embedding, 512 dimensions, cosine distance.
Payload fields (JSON object):
- omeka_item_id (int)
- title (string)
- omeka_url (string; public item URL)
- thumb_url (string; chosen thumbnail/derivative URL)
- catalog_version (int; constant 1 in this spec)

Additional experimental fields (e.g. year, subjects, ocr_text, text_blob, dominant_color, curator_notes) MAY be present in payloads but are not part of the Catalog v1 contract.

Example payload

{
  "omeka_item_id": 1234,
  "title": "Untitled blue head",
  "omeka_url": "https://catalog.jonsarkin.com/s/item1234",
  "thumb_url": "https://catalog.jonsarkin.com/files/large/1234.jpg",
  "catalog_version": 1
}

6. Sync Job Behaviour (omeka-sync)

Fetch all Omeka items using the Artwork template.
For each item:
1. Resolve omeka_item_id, Title, main image URL.
2. Skip items with no valid image.
3. Download image, run through CLIP encoder → 512-dim visual_vec.
4. Upsert into Qdrant:
  - point_id = omeka_item_id
  - Set visual_vec and payload as above.
Optionally: mark deleted/hidden Omeka items as soft-deleted.

Design Decisions: Omeka–Qdrant Integration

This section records the current architectural decisions for the Fish City visual catalog integration between Omeka-S and Qdrant.

ADR-001: Point identity and versioning

Decision: The Qdrant point ID for collection omeka_items is always the Omeka item ID.
- Qdrant id = Omeka internal item ID (e.g. 1810).
- Payload duplicates this as omeka_item_id (or item_id), for readability and future migrations.
Decision: URLs stored in Qdrant are opaque and owned by Omeka.
- omeka_url and thumb_url are treated purely as references; Qdrant does not host images.
- Any changes to URL structure are handled by re-syncing from Omeka, not by editing Qdrant directly.
Decision: catalog_version is monotone.
- Version values: 1 for Catalog v1 (image-only), 2 for Catalog v2 (human metadata + text embeddings), higher integers for future versions.
- Items may move from version 1 → 2 → 3, but are never downgraded to a lower version number.

Clients SHOULD rely only on the fields listed in the Catalog v1/v2 specifications; other payload keys are considered internal and may change.

These decisions ensure that Qdrant remains a derived, replaceable index whose records can always be regenerated from Omeka-S as the canonical source.

Fish City Visual Catalog – Catalog v2 Specification

1. Purpose

Catalog v2 extends Catalog v1 by incorporating a small, stable set of human-entered metadata fields from Omeka-S and adding CLIP text embeddings. The goal is to support semantic text queries (e.g., “blue abstract on cardboard”) and simple faceted filters, while keeping metadata requirements realistic for day-to-day cataloging.

2. Scope

Omeka remains the canonical record.
Qdrant stores:
- Image embeddings (visual_vec)
- Text embeddings (text_vec_clip)
- A limited, normalized subset of Omeka metadata.
No LLM-generated themes or OCR yet (reserved for Catalog v3).

3. Additional Omeka Requirements

For items included in Catalog v2, the following fields SHOULD be populated where known:

Description (free text)
Subject (keywords / concepts)
Medium or artMedium
dateCreated (at least year)
height and width (with a consistent unit, e.g., cm)

4. Qdrant Collection Schema Changes

Same collection: omeka_items. Existing points from v1 are upgraded in-place.

Vectors:
- visual_vec – unchanged.
- text_vec_clip – CLIP text embedding, 512 dimensions, cosine distance.
Payload additions:
- year (int, derived from dateCreated where possible)
- subjects (array of strings; Omeka Subject labels)
- mediums (array of strings; from Medium / artMedium)
- dimensions_cm (object, e.g. {"height": 30.5, "width": 22.9})
- catalog_version (int; set to 2 once enriched)

Example payload (v2)

{
  "omeka_item_id": 1234,
  "inventory_id": "JS-2025-001",
  "title": "Untitled blue head",
  "omeka_url": "https://catalog.jonsarkin.com/s/item1234",
  "thumb_url": "https://catalog.jonsarkin.com/files/large/1234.jpg",
  "year": 1994,
  "subjects": ["portrait", "blue", "abstract"],
  "mediums": ["acrylic", "cardboard"],
  "dimensions_cm": {"height": 30.5, "width": 22.9},
  "catalog_version": 2
}

5. Text Embedding Construction

For each item, text_vec_clip is computed from a single concatenated string:

TEXT_INPUT =
  Title + ". " +
  Description + " " +
  "Subjects: " + Subject-list + ". " +
  "Medium: " + Medium-list + ". " +
  "Year: " + Year

6. Sync / Upgrade Behaviour

Run the Catalog v1 sync if needed (ensuring all items exist in Qdrant).
For each eligible item:
1. Fetch the extended metadata fields from Omeka.
2. Normalize year and dimensions into the v2 payload format.
3. Build TEXT_INPUT as above and encode via CLIP text encoder.
4. Upsert into Qdrant:
  - Update payload with new fields and set catalog_version = 2.
  - Set text_vec_clip for that point.

7. Query Semantics

Text query: encode user text with CLIP text encoder, search text_vec_clip (cosine) in omeka_items.
Image query: unchanged from Catalog v1 (search visual_vec).
Optional filters:
- By year or year ranges
- By mediums (exact-match string filters)
- By subjects (exact-match string filters)

Catalog v2: CLIP Search API

Catalog v2 introduces a GPU-backed CLIP search sidecar (clip-api) that provides read-only semantic search over the Omeka-S catalog using the Qdrant omeka_items collection.

Base URL

All endpoints are served from the internal CLIP API service and reverse-proxied into the Omeka site:

BASE_URL = https://catalog.jonsarkin.com/clip-api

Authentication

The API is read-only and intended to be consumed by Omeka theme/plugins and internal tools. Public clients should call Omeka endpoints that wrap this API, not clip-api directly.

GET /healthz

Lightweight health check for the CLIP API and Qdrant.

GET ${BASE_URL}/healthz

Response 200 OK

{
  "status": "ok",
  "qdrant": "ok",
  "model": "ViT-B-32 laion2b_s34b_b79k"
}

POST /search/text

Semantic text search over artworks. The query string is encoded with the CLIP text encoder and searched against the visual_vec field in omeka_items.

POST ${BASE_URL}/search/text
Content-Type: application/json

Request body

{
  "query": "blue abstract",
  "limit": 20,
  "filters": {
    "year": { "gte": 1990, "lte": 2025 },
    "subjects": ["cactus", "portrait"]
  }
}

query (string, required): free-form text.
limit (int, optional, default 20, max 100).
filters (object, optional): maps directly to Qdrant payload filters over fields such as year, subjects, collection, etc.

Response 200 OK

{
  "results": [
    {
      "omeka_item_id": 71,
      "score": 0.923,
      "payload": {
        "title": "Super artist 33",
        "thumb_url": "https://catalog.jonsarkin.com/files/original/...",
        "collection": "omeka",
        "year": 2025
      }
    }
  ]
}

score: similarity score in [0, 1], higher = more similar.
payload is a subset of the Qdrant payload for convenience; Omeka remains the canonical source of metadata.

POST /search/similar

“More like this” search using an existing artwork as the anchor. Uses Qdrant recommend on visual_vec.

POST ${BASE_URL}/search/similar
Content-Type: application/json

Request body

{
  "omeka_item_id": 71,
  "limit": 20,
  "filters": {
    "collection":

< Previous page