Skip to content

Embedding

Structural contract for embedders (FR-14, FR-15, design ยง3.1). Concrete in-tree provider: MiniLMEmbedder.

Protocol surface

from typing import Literal, Protocol, runtime_checkable

@runtime_checkable
class Embedding(Protocol):
    @property
    def model_id(self) -> str: ...
    @property
    def revision(self) -> str: ...
    @property
    def content_hash(self) -> str: ...
    @property
    def ndims(self) -> int: ...

    async def embed(
        self,
        texts: list[str],
        *,
        kind: Literal["query", "document"],
    ) -> list[list[float]]: ...

The four identity properties feed the FR-8 5-tuple drift gate (model_id, revision, content_hash, ndims, schema_v) written into LanceDB schema metadata at VectorStore.bootstrap() time. Mismatch on re-entry raises IncompatibleEmbeddingHashError.

kind is required by the Protocol for forward-compat with asymmetric models (bge / e5 v5+) -- symmetric models (MiniLM) are free to ignore it, but the parameter is mandatory day one.

MiniLMEmbedder

POC reference implementation backed by sentence-transformers/all-MiniLM-L6-v2. Lives at stargraph.stores.embeddings.

Constants

Constant Value Purpose
MINILM_MODEL_ID sentence-transformers/all-MiniLM-L6-v2 HF model id pin.
MINILM_NDIMS 384 Embedding dimensionality (from model card).
MINILM_SHA256 53aa51172d142c89d9012cce15ae4d6cc0ca6895895114379cacb4fab128d9db Pinned safetensors sha256 (FR-15).
MINILM_MAX_TOKENS 256 Positional-window cap; longer inputs are clipped with a structlog warning.

Constructor

from pathlib import Path
from stargraph.stores import MiniLMEmbedder

embedder = MiniLMEmbedder(
    model_path=None,                    # mode 1 if set
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
    allow_download=True,                # mode 3 if False -> mode 2 semantics
    expected_sha256=MINILM_SHA256,
)
Param Type Default Notes
model_path Path \| None None Mode 1 if set: explicit local directory.
model_id str MINILM_MODEL_ID HF repo id.
revision str "main" HF revision / commit sha.
allow_download bool True False forces cache-only (mode 2 semantics).
expected_sha256 str MINILM_SHA256 Override only if pinning a different revision.

Dependencies

Optional extra: stargraph[skills-rag] (sentence-transformers>=5.0,<6, huggingface_hub>=0.25). The sentence-transformers import is lazy -- it only fires inside __init__ after the safetensors hash check passes.

Three load modes

Mode Trigger Behaviour
1. Explicit path model_path=Path(...) Load directly from the local directory. Used by tests and fully-offline operators who pre-stage the directory. No HF Hub call.
2. HF_HUB_OFFLINE=1 env var set Cache-only via huggingface_hub.snapshot_download(local_files_only=True). No network. Raises if cache is empty.
3. allow_download flag model_path is None allow_download=False matches mode 2 semantics; True (default) permits a network fetch into the HF cache on first use.

SHA-256 verification (FR-15)

After the model directory is resolved, the safetensors weights file (model.safetensors) is hashed and compared against expected_sha256. Drift raises stargraph.errors.EmbeddingModelHashMismatch carrying:

Context key Value
model_id The configured HF model id.
expected_sha256 The pinned hash.
actual_sha256 What we just hashed.
model_path The full safetensors file path.

Token-clip behaviour (MINILM_MAX_TOKENS)

embed() clips inputs to 256 tokens before encoding. Tokens are counted via self._model.tokenizer.encode(text, add_special_tokens=False). Truncation logs a structlog minilm.input_clipped warning with the original token count -- a long document will not silently degrade retrieval quality past the model's positional window.

Async wrapping

SentenceTransformer.encode is sync; MiniLMEmbedder.embed wraps it through asyncio.to_thread so callers stay non-blocking. normalize_embeddings=True is set so the returned vectors are unit-length.

MiniLMEmbedder.fake() / FakeEmbedder

Deterministic, dependency-free POC test embedder. Hashes each input string with sha256, seeds numpy.random with the digest, and emits an L2-normalised vector. content_hash is a stable sentinel ("fake-" + "0" * 59) so the FR-8 drift gate still round-trips through bootstrap. Not for production -- smoke tests only.

embedder = MiniLMEmbedder.fake()  # returns FakeEmbedder

YAML wiring

Embedders are not declared at the IR top level today; the LanceDBVectorStore constructor takes an Embedding instance directly.

Errors raised

Error Raised when
EmbeddingModelHashMismatch safetensors sha256 does not match expected_sha256.
OSError / FileNotFoundError mode 2 cache empty, or mode 1 model_path missing.
(Bubbling) huggingface_hub errors mode 3 network fetch failed.