Embeddings API

Semantic helpers for search, reconciliation, and similarity over text - model-free by default (a hashing embedder), with an opt-in transformers backend (MiniLM via WebGPU) for true synonym matching. For the story and labs, see the Insights guide.

Import

import {
  findSimilar, reconcileLabels, createEmbedder,
  cosineSimilarity, hashEmbed,
} from "@michi-vz/insights/embeddings";
// also re-exported from the package root: import { findSimilar } from "@michi-vz/insights";

`findSimilar` - rank items by meaning

Try it - type a term and the labels rank by meaning (model-free by default):

⌕

A dashboard with 8 KPIs. Don't remember the exact name? Ask in plain English - embeddings rank every series by what your words mean, then highlight the best match.

Model-free ranks by shared letters, so customer finds the customer KPIs - but money coming in can't reach Revenue (no letters in common). Load a model (top-right) to search by meaning.

const ranked = await findSimilar("revenue", labels, (s) => s, { backend: "hash" });
// → [{ item, score }, ...] sorted by descending cosine similarity

Param	Type	What it does
`query`	`string`	The text to match against.
`items`	`T[]`	Candidates to rank.
`text`	`(item: T) => string`	Extracts the comparable string from each item.
`options`	`EmbedOptions`	`{ backend?, model?, dim? }` - see below.

`reconcileLabels` - merge messy labels that mean the same thing

Try it - messy, differently-spelled labels collapse into clean groups:

Three countries reported sales, but three data sources spelled them 10 different ways. Reconcile merges by similarity (the embedding model); Certify adds a second specialist - a small LLM that confirms each merge and names it. The result below is shown instantly; switch to Real model to download the model and run it yourself.

10 raw labels — messy, duplicated, wrong totals

Charted raw, each spelling is its own bar - the totals are wrong and split. Step through Reconcile and Certify to fix them.

The same entity often arrives spelled many ways ("United States" / "usa" / "United States"); grouping by exact match splits it into buckets with wrong totals. This embeds each label and greedily clusters by cosine similarity (single-linkage) with a confidence gate, so distinct entities never collapse just by being near. Sum your series by each group's name (the cluster medoid) for clean totals.

const groups = await reconcileLabels(labels, { threshold: 0.7, margin: 0.05 });
// → [{ name, members: [...] }, ...]

Option	Type	Default	What it does
`threshold`	`number`	`0.7` (transformers) / `0.6` (hash)	Minimum cosine to merge into a group.
`margin`	`number`	`0.05`	Confidence gate: a label merges only when it is at least this much closer to its best group than the next-best. `0` disables.
`embedder`	`Embedder`	optional	Reuse a prebuilt embedder instead of creating one.
`backend` / `model` / `dim`	`EmbedOptions`	hash	Inherited embedder options.

The model-free default merges spelling/case/typos offline; { backend: "transformers" } also merges synonyms, abbreviations, and translations. For authoritative canonical names (USA -> United States), pair it with an alias list or an LLM (see the guide's "Certify" recipe).

`createEmbedder` / `cosineSimilarity` / `hashEmbed` - the primitives

const embedder = await createEmbedder({ backend: "transformers" }); // falls back to hash if unavailable
const [a, b] = await embedder.embed(["customer", "customers"]);
cosineSimilarity(a, b); // 0..1
hashEmbed("customer", 128); // deterministic char-ngram vector, no model

Function	Signature	Notes
`createEmbedder`	`(options?: EmbedOptions) => Promise<Embedder>`	`Embedder` is `{ backend, embed(texts) }`. `backend: "transformers"` lazy-loads MiniLM and falls back to hash if the dep/model is missing.
`cosineSimilarity`	`(a: number[], b: number[]) => number`	Standard cosine; `0` for a zero vector.
`hashEmbed`	`(text: string, dim?: number) => number[]`	Model-free fuzzy char-ngram embedding (default `dim` 128); makes `customer ~ customers` without any model.

EmbedOptions = { backend?: "hash" | "transformers"; model?: string; dim?: number }.

Insights guide

Embeddings API ​

Import ​

findSimilar - rank items by meaning ​

reconcileLabels - merge messy labels that mean the same thing ​

createEmbedder / cosineSimilarity / hashEmbed - the primitives ​

Embeddings API

Import

`findSimilar` - rank items by meaning

`reconcileLabels` - merge messy labels that mean the same thing

`createEmbedder` / `cosineSimilarity` / `hashEmbed` - the primitives