Skip to content

Embeddings API

Semantic helpers for search, reconciliation, and similarity over text - model-free by default (a hashing embedder), with an opt-in transformers backend (MiniLM via WebGPU) for true synonym matching. For the story and labs, see the Insights guide.

Import

ts
import {
  findSimilar, reconcileLabels, createEmbedder,
  cosineSimilarity, hashEmbed,
} from "@michi-vz/insights/embeddings";
// also re-exported from the package root: import { findSimilar } from "@michi-vz/insights";

findSimilar - rank items by meaning

Try it - type a term and the labels rank by meaning (model-free by default):

A dashboard with 8 KPIs. Don't remember the exact name? Ask in plain English - embeddings rank every series by what your words mean, then highlight the best match.

Model-free ranks by shared letters, so customer finds the customer KPIs - but money coming in can't reach Revenue (no letters in common). Load a model (top-right) to search by meaning.

ts
const ranked = await findSimilar("revenue", labels, (s) => s, { backend: "hash" });
// → [{ item, score }, ...] sorted by descending cosine similarity
ParamTypeWhat it does
querystringThe text to match against.
itemsT[]Candidates to rank.
text(item: T) => stringExtracts the comparable string from each item.
optionsEmbedOptions{ backend?, model?, dim? } - see below.

reconcileLabels - merge messy labels that mean the same thing

Try it - messy, differently-spelled labels collapse into clean groups:

Three countries reported sales, but three data sources spelled them 10 different ways. Reconcile merges by similarity (the embedding model); Certify adds a second specialist - a small LLM that confirms each merge and names it. The result below is shown instantly; switch to Real model to download the model and run it yourself.

10 raw labels — messy, duplicated, wrong totals

Charted raw, each spelling is its own bar - the totals are wrong and split. Step through Reconcile and Certify to fix them.

The same entity often arrives spelled many ways ("United States" / "usa" / "United States"); grouping by exact match splits it into buckets with wrong totals. This embeds each label and greedily clusters by cosine similarity (single-linkage) with a confidence gate, so distinct entities never collapse just by being near. Sum your series by each group's name (the cluster medoid) for clean totals.

ts
const groups = await reconcileLabels(labels, { threshold: 0.7, margin: 0.05 });
// → [{ name, members: [...] }, ...]
OptionTypeDefaultWhat it does
thresholdnumber0.7 (transformers) / 0.6 (hash)Minimum cosine to merge into a group.
marginnumber0.05Confidence gate: a label merges only when it is at least this much closer to its best group than the next-best. 0 disables.
embedderEmbedderoptionalReuse a prebuilt embedder instead of creating one.
backend / model / dimEmbedOptionshashInherited embedder options.

The model-free default merges spelling/case/typos offline; { backend: "transformers" } also merges synonyms, abbreviations, and translations. For authoritative canonical names (USA -> United States), pair it with an alias list or an LLM (see the guide's "Certify" recipe).

createEmbedder / cosineSimilarity / hashEmbed - the primitives

ts
const embedder = await createEmbedder({ backend: "transformers" }); // falls back to hash if unavailable
const [a, b] = await embedder.embed(["customer", "customers"]);
cosineSimilarity(a, b); // 0..1
hashEmbed("customer", 128); // deterministic char-ngram vector, no model
FunctionSignatureNotes
createEmbedder(options?: EmbedOptions) => Promise<Embedder>Embedder is { backend, embed(texts) }. backend: "transformers" lazy-loads MiniLM and falls back to hash if the dep/model is missing.
cosineSimilarity(a: number[], b: number[]) => numberStandard cosine; 0 for a zero vector.
hashEmbed(text: string, dim?: number) => number[]Model-free fuzzy char-ngram embedding (default dim 128); makes customer ~ customers without any model.

EmbedOptions = { backend?: "hash" | "transformers"; model?: string; dim?: number }.

Insights guide

Free and open source. MIT licensed.