Migrating from Fastembed ONNX to Hugging Face TEI
The specific trade-offs, the retrieval quality delta on an institutional corpus, and why the infra complexity was worth it.
Fastembed ONNX on CPU is a reasonable starting point for a RAG system. Easy to install, model catalog covers the obvious choices, and you can have a working embedding pipeline in an afternoon without standing up extra infrastructure. For VeriCite's early proof-of-concept it was the right call.
It stopped being the right call when retrieval quality became the constraint. The model catalog that is practical to serve through Fastembed doesn't include paraphrase-multilingual-MiniLM-L12-v2 at the throughput we needed on an institutional corpus, and it doesn't include a cross-encoder reranker like BAAI/bge-reranker-v2-m3 at all. Reranking matters disproportionately on the kind of corpus VeriCite deals with: documents where lexical overlap between a query and the relevant passage is low, and a bi-encoder's cosine similarity is a noisy signal.
Hugging Face TEI gives you both. Purpose-built inference server for text embeddings and rerankers, batching tuned for GPU throughput. The cost is operational: GPU nodes to run, batching configuration to manage, k8s manifests to ship for a TEI sidecar alongside Qdrant. Real infra complexity, and it doesn't pay for itself unless retrieval quality is actually on the critical path.
In this case it was. Retrieval quality on the institutional corpus improved measurably after the migration, most visibly on queries whose vocabulary diverged from the relevant document's. That is the problem the cross-encoder solves: it sees the full (query, passage) pair instead of comparing independent embeddings. Bi-encoder for the neighbourhood, reranker for the door.
Infra overhead is now a fixed cost. The model options going forward are wide open.