# Abhishek Kaushik — akaushik.org (full content)

> Single-file concatenation of all portfolio content. Generated 2026-04-20T20:41:56.006Z.

This file follows the [llms-full.txt](https://llmstxt.org) convention. Section wrappers (`<about>`, `<services>`, `<case-study>`, `<post>`) let agents slice by section without parsing HTML.

---

<about>

> The short version

I'm Abhishek — an AI engineer who builds agent systems that businesses can *actually* run.

For the last six years I've been shipping software — AI and platform engineering for the past stretch of it, most recently on the agents framework behind Bluehost's AI products. Outside of that, I'm building [Neev](https://akaushik.org/work/neev), a modular operations platform for Indian MSMEs starting with textile distribution — because the most exciting place for AI right now isn't another consumer chatbot. It's the **63 million businesses** still running on WhatsApp messages and paper ledgers.

My way into AI was Andrej Karpathy's *Zero to Hero* series. I didn't just watch it — I built micrograd and makemore from scratch to understand what I was watching. That habit, going to the foundations rather than the abstractions, is how I work on most things. Including this site.

- **Now** — Bluehost · agents framework backend
- **Building** — Neev · MSME operations platform
- **Co-founder / CTO** — VeriCite · curat.money
- **Writes** — agent systems · AI for traditional business

</about>

---

<services>

## Services

Three engagement shapes. In / Out / Fit called out explicitly so you can tell whether a conversation is worth starting.

### S/01 · Agent MVP — 4 — 6 weeks

A working agent system, in production, doing one thing your business needs done. Tool-use, memory, observability from day one.

- **In** — Problem framing, agent design, production deploy, handoff docs
- **Out** — Brand work, non-AI product surface area
- **Fit** — Teams with a clear repeated workflow and real data

### S/02 · AI enablement for an MSME operation — 8 — 12 weeks

For businesses that run on WhatsApp and spreadsheets. A layer that removes typing and memory load — not one that replaces judgment.

- **In** — Operator interviews, a narrow AI surface, training & rollout
- **Out** — Full ERP replacement, ledger migration
- **Fit** — Owner-operators who want software that survives the day

### S/03 · Production-hardening a POC — 3 — 6 weeks

You have a prototype that works in a demo. It needs to work on a Tuesday at 3pm with 200 users. I take it the rest of the way.

- **In** — Observability, eval harness, rate limits, error budgets, on-call runbook
- **Out** — New features during hardening
- **Fit** — Teams who've shipped a POC and lost a night's sleep to it

</services>

---

# Case studies

<case-study slug="neev">

# Neev

> Bringing AI to an industry that still runs on WhatsApp.

## Context

India has roughly 63 million MSMEs. Most of them (textile traders, regional distributors, small manufacturing units) operate on WhatsApp conversations, paper ledgers, a Tally install, and a few shared spreadsheets. This isn't backwardness. It's a pragmatic equilibrium built on trust relationships, cash-flow timing, and a low tolerance for software that doesn't survive contact with the day.

Any AI product aimed at this audience has to earn its place under that daily reality, not above it. That means the platform has to be useful before it's smart — and the smart parts have to be invisible enough that an operator who has never touched AI before doesn't have to think about them.

## Problem

Distributors in the textile supply chain lose time and money at very specific seams: order capture (still mostly over WhatsApp), ledger reconciliation (still mostly by hand), GST-compliant invoicing (patchy), and memory of who owes whom what (fragile, relationship-dependent). Generic ERPs don't fit because they assume a workflow that doesn't exist in this business yet. Pure chatbot layers don't fit because they solve a symptom, not the operations below it.

The actual problem is building a platform with enough structural honesty that a real MSME can adopt it incrementally, then layering AI in where it removes friction rather than demanding new behavior.

## Approach

The architecture reflects the operating environment, not a tech-stack wishlist. Four principles shaped it.

- **Modular monolith with multi-tenant discipline.** One deployable, clean module boundaries, `tenant_id` enforced at every persistence boundary. This trades early micro-service flexibility for operational sanity at the scale MSMEs actually operate. The wrong distributed system on a two-person team costs far more than modularizing later.
- **Textile distribution as beachhead, not as a limit.** The data models are designed so the platform generalizes to adjacent MSME verticals. The first release narrows to where there is direct operator input and real feedback.
- **AI where it reduces typing and memory load, not where it replaces judgment.** Order capture, reconciliation assistance, and narrative summarization over the operator's own ledger — not autonomous decisioning. The operator stays in control of the outcome.
- **Engineering process as proof of intent.** PRD, ADRs, roadmap, changelog, process-gate scripts. Not process theatre. Process that earns trust from any future technical buyer who looks under the hood.

## What shipped

The first deployable version of the Neev platform is live. It ships a multi-tenant foundation with strict `tenant_id` invariants throughout the persistence layer, core domain models for the textile distribution workflow (orders, ledger, parties, items), and the first AI-assisted surfaces in the operator workflow. Public engineering artifacts (README, PRD, architectural decision records) are available alongside the build.

The stack is Next.js on the front end against a Postgres-backed monolith, with Neev's own process-gate scripts keeping the repo honest across builds.

*(Abhishek to fill: first tenant onboarded, measured effect on at least one workflow, specific milestones from PRs shipped to date.)*

## Trade-offs

**Monolith over microservices.** Deliberately. A small team debugging a distributed tracing problem while onboarding a first paying customer is a bad place to be. The module boundaries are clean; the leap to separate services, if it ever makes sense, is a refactor, not a rebuild.

**Vertical-first over horizontal-first.** Textile distribution is the wedge. A horizontal "MSME platform" from day one would have drifted into genericism and served no vertical well. The constraint is the product.

**Owner-operator UX over admin-console UX.** The primary user sits on a chair in a shop, often on a phone. Desktop-dense dashboards aren't the frame. The UI is built around how a distributor actually moves through a working day.

## Honest scope

*(Abhishek to fill: name what's live vs. what's in progress, and where real operators are vs. aren't using it yet. Resist the urge to overstate — understated and true beats polished and inflated for this audience.)*

> The engineering artifacts are real and reviewable. The operator validation is in progress. That's the honest state of a system built correctly from the start, not retrofitted after the fact.

## What I'd do next

*(Placeholder — fill after v1 stabilizes. Candidates: vernacular-language input for WhatsApp order capture; offline-first mobile operator app; reconciliation assistant as the first full agent surface with audit-grade output.)*

- **Role** — Co-founder & CTO — product, architecture, build
- **Year** — 2026 — now
- **Stack** — Next.js, Postgres, multi-tenant monolith
- **Evidence of** — MSME depth · systems & product discipline
- **Canonical** — https://akaushik.org/work/neev

</case-study>

---

<case-study slug="vericite">

# VeriCite

> A retrieval stack an institution can actually trust with its own words.

## Context

Institutional RAG is a different problem from consumer RAG. A consumer chatbot can miss the right passage and the user shrugs and rephrases. An institution can't. When a knowledge system surfaces a passage, the institution will be asked why that passage appeared: by an auditor, a regulator, or a faculty committee. The answer has to hold. The embedding model, the reranker, the chunking strategy, the tenancy model — every layer is load-bearing in a way it simply isn't when the stakes are a search result.

VeriCite is a multi-tenant institutional RAG platform. The product surface runs on Vercel. The retrieval pipeline behind it has to earn the trust that the product surface asks for.

## Problem

I started the pipeline on Fastembed running ONNX models on CPU. That's a reasonable place to start: low infrastructure overhead, quick to iterate, and good enough to validate the retrieval idea. But it imposed real limits that compounded as the product matured.

First, the embedding-model catalog available through Fastembed's ONNX path is narrower than what HF TEI can serve. I needed `paraphrase-multilingual-MiniLM-L12-v2` as the primary embedder and `BAAI/bge-reranker-v2-m3` as the cross-encoder reranker — both worth running correctly, and the throughput curve on CPU-ONNX gets expensive before it gets good. Second, batching behavior and GPU-aware scheduling are first-class in TEI; bolting them onto a CPU-ONNX path is the wrong direction. Third, the system needed to stay defensibly multi-tenant from the start, with identity enforced at the query path, not added as a filter at read time.

## Approach

**Move embedding and reranking to Hugging Face Text Embeddings Inference (TEI).** TEI is purpose-built for serving transformer-based embedders and rerankers: it handles dynamic batching, GPU allocation, and model-specific optimizations that Fastembed on CPU can't match. I migrated the pipeline to TEI — it wasn't a drop-in swap. I re-validated retrieval quality on the institution's corpus, reconciled output format differences, and updated the upstream chunking pipeline to feed TEI's batch interface correctly.

**Separate the embedding and reranking paths.** `paraphrase-multilingual-MiniLM-L12-v2` answers the first question: is this chunk plausibly relevant? `BAAI/bge-reranker-v2-m3` answers the harder one: of the plausibly relevant chunks, which ones actually are? Keeping these as independent services means each can scale, be swapped, or be versioned without destabilizing the other.

**Qdrant as the vector store.** Per-tenant filtering is enforced at the retrieval layer via Qdrant's payload filters, not added on top afterward. This is the correct place to enforce tenancy — letting vectors cross tenant boundaries at retrieval and filtering later is a design flaw, not a trade-off.

**Ory for identity.** Institutional tenants arrive with non-trivial auth requirements: SSO, provisioned users, role hierarchies. Ory handles that surface area properly. A lightweight JWT library would have required rebuilding it.

**Monorepo discipline.** The codebase lives in a monorepo with four top-level namespaces: `apps/` for product surfaces, `packages/` for shared retrieval libraries and utilities, `infra/` for Terraform and cloud configuration, and `k8s/` for Kubernetes manifests. This layout lets the serving infrastructure, the retrieval library, and the Vercel-deployed product surface evolve on independent cadences without accumulating drift between them.

## What shipped

- Production-ready institutional RAG stack running `paraphrase-multilingual-MiniLM-L12-v2` for embedding and `BAAI/bge-reranker-v2-m3` for reranking, served via HF TEI.
- Multi-tenant retrieval with tenant-scoped Qdrant collections and payload-level filters enforced at query time.
- Identity layer built on Ory, covering SSO and provisioned-user workflows for institutional customers.
- Monorepo with CI/CD pipelines, Kubernetes manifests under `k8s/`, and the product web surface deployed on Vercel.
- Completed migration from Fastembed ONNX to TEI, including corpus re-validation and batch pipeline updates.

*(Abhishek: confirm current deployment status and any institution-specific retrieval quality numbers you can share publicly.)*

## Trade-offs

**TEI over Fastembed.** I chose more infrastructure to operate — a running TEI service with GPU access rather than an in-process ONNX runtime. The payoff is a wider model catalog, correct batching behavior, and a throughput curve that doesn't break before you want it to. For a consumer prototype, Fastembed is fine. For an institutional product where retrieval quality is the core claim, the infrastructure cost is the right trade.

**Qdrant over a Postgres vector extension.** I chose a purpose-built vector store, which adds an extra service to operate and monitor. The operational overhead is real. What it buys is a retrieval layer designed for the access patterns a RAG pipeline actually has — payload filtering, named collections, and a query API that doesn't require thinking around a general-purpose database's constraints.

**Ory over a lightweight auth library.** The surface area is larger and the learning curve is steeper. Institutional identity requirements (provisioned users, SSO, org-level RBAC) make that surface area necessary — not incidental. A lightweight library would have deferred the problem, not solved it.

## Honest scope

*(Abhishek: confirm how much of this is your individual design/implementation vs. team/contracted work; list explicitly what's yours to claim.)*

- **Role** — Co-founder & CTO — retrieval pipeline
- **Year** — 2026 — now
- **Stack** — HF TEI, Qdrant, Ory, Kubernetes, Vercel
- **Evidence of** — Institutional AI-systems sophistication
- **Canonical** — https://akaushik.org/work/vericite

</case-study>

---

<case-study slug="bluehost-agents">

# Bluehost · agents framework

> The foundational platform behind Bluehost's agentic AI products.

Employer work under scope review. The shape of the contribution — a hand in maintaining and continuously improving the foundational platform behind Bluehost's agentic AI products — is real; the specifics are under confidentiality. Available to discuss in conversation: [hello@akaushik.org](mailto:hello@akaushik.org).

- **Role** — Platform engineer · ongoing
- **Year** — 2025 — now
- **Stack** — Agent runtime, tool-calling, observability
- **Evidence of** — Operating at scale · team context
- **Canonical** — https://akaushik.org/work/bluehost-agents

</case-study>

---

<case-study slug="curat-money">

# curat.money

> A fair-comparison tool for crypto cards, built like a real product.

## Context

Consumer comparison sites for niche financial products tend to fail in the same direction: they publish data that is shallow, stale, and optimized for affiliate conversion rather than accuracy. The problem is structural. Provider data is scattered across marketing pages with inconsistent vocabulary, buried disclosures, and a strong incentive to make trade-offs opaque. A comparison tool that doesn't account for this isn't neutral — it inherits the provider's framing.

curat.money is built on a different premise: that a rigorous, source-checked comparison platform, with custody status made explicit and country coverage reflected honestly, is worth more than yet another shallow aggregator. The product serves people evaluating crypto cards — cards that allow holders to spend against a crypto balance or crypto collateral. The engineering problem is how to keep the data accurate and the platform operationally real at a scale where manual curation breaks down.

## Problem

Card data is spread across provider sites that each use their own vocabulary and their own level of willingness to surface trade-offs. Users need to filter by country availability, custody model, supported assets, fees, and rewards — and they need to trust the filter. A comparison that silently lags a provider update or omits a custody distinction is worse than no comparison, because it gives false confidence.

Beyond the data quality problem, the platform has to be operationally real: multi-environment hygiene so local development doesn't corrupt production state, a K8s deployment that matches where the product is heading rather than where it started, role-based access so internal operators and end users see the right surfaces, and a build pipeline that doesn't rot between runs.

## Approach

The data spine is a scrape-normalize-verify loop. `scrape_and_update_prod.py` pulls provider data on a repeatable schedule; `match_cards.py` normalizes the raw results to a canonical schema; `custody_scrape_results_local.json` captures the local scrape state used for development and diff verification. Custody status is not derived from marketing copy — it is checked explicitly before a record is marked production-ready. The pipeline is designed to be re-run cleanly, so a failed scrape degrades gracefully rather than publishing partial data.

Multi-role RBAC gates what internal operators, external contributors, and end users can do. This matters more than it sounds: a comparison platform that accepts external data contributions without access control is one bad merge away from compromised trust signals.

The production deployment runs on Kubernetes. `cloudbuild-web.yaml` drives the Cloud Build CI/CD path — the build is observable, auditable, and not a one-off shell script that only the person who wrote it knows how to run. Developer experience is organized with Make-based targets: the kind of DX investment that reduces cognitive overhead across months of iteration, not just the first week.

## What shipped

A data pipeline producing normalized, verified card records from live provider sources. A multi-environment web product with role-based access across internal and public surfaces. A production K8s deployment with a Cloud Build pipeline and environment parity between local and production. The comparison table reflects custody status, country coverage, and card features from a source-checked record set rather than manually maintained copy.

## Trade-offs

Product rigor over content-farm velocity. Fewer listings, each defensible, rather than a large catalog of unverified entries — the comparison is only valuable if users can trust the filter. K8s over simpler hosting: this is deliberate over-engineering for a v0, chosen because the deployment model should match where the product is heading, not minimize the setup cost at launch.

## Honest scope

Live status, exact ownership boundaries, and whether to publish product metrics: Abhishek will confirm these before this section is finalized. Some readers will bring strong priors about the asset class; the case study's claim is about product engineering discipline and data pipeline rigor, not about the category itself.

- **Role** — CTO · Tech Lead
- **Year** — 2026 — now
- **Stack** — K8s, RBAC, CI/CD, High-throughput data pipeline
- **Evidence of** — Data pipeline to web product
- **Canonical** — https://akaushik.org/work/curat-money

</case-study>

---

# Writing

<post slug="building-this-portfolio">

# Building this portfolio

> Why the site ships with the same process I'd bring to a client engagement — PRD, ADRs, ROADMAP, process-gate, agent-readiness, the whole thing in the open.

A portfolio is supposed to show the work. The tension is that *showing* the work usually means showing the surface — the typography, the case studies, the tagline. The decisions underneath are what matter, and they don't fit in a screenshot.

So this one is built a different way: the same process I'd use on a paid client engagement, left in public view, with every decision and trade-off legible to anyone who reads the repo.

## The receipts

- [`docs/PRD.md`](https://github.com/Zireael26/developerabhishek.live/blob/main/docs/PRD.md) — product requirements. What the site has to do, who it's for, what a good reading experience looks like for the MSME owner *and* the senior engineer.
- [`docs/ROADMAP.md`](https://github.com/Zireael26/developerabhishek.live/blob/main/docs/ROADMAP.md) — phased delivery plan. Six phases from scaffold to launch; each phase is a set of one-PR slices.
- [`docs/adr/`](https://github.com/Zireael26/developerabhishek.live/tree/main/docs/adr) — architecture decisions, numbered and dated. Seven so far. Each captures what was decided, what was considered, and why.
- [`docs/AGENT_READINESS.md`](https://github.com/Zireael26/developerabhishek.live/blob/main/docs/AGENT_READINESS.md) — the contract for agent crawlers. llms.txt, sitemap, RFC 8288 Link headers, content negotiation on `Accept: text/markdown`, api-catalog, Agent Skills. Implementation in Phase 4.
- [`docs/CHANGELOG.md`](https://github.com/Zireael26/developerabhishek.live/blob/main/docs/CHANGELOG.md) — every shipped change, categorised, referencing the PR that landed it.
- [`scripts/process-gate.mjs`](https://github.com/Zireael26/developerabhishek.live/blob/main/scripts/process-gate.mjs) — the pre-commit hook that refuses to let me commit code without a CHANGELOG entry (R1), structural changes without an ADR (R2), or epic changes without a ROADMAP update (R3). It ran on every one of the 30+ PRs that built the site.

## Why this way

There are two readers I want to give full signal to.

The **MSME owner** I might work with on Neev is looking for clarity. Can this person explain what they've built? Will it survive contact with the day? The home page and the case studies answer that — in plain English, with honest scope on what worked and what didn't.

The **senior engineer** I might work with on a platform team is looking for rigor. How do they handle change? How do they treat deprecations? Are they the kind of person who ships and forgets, or the kind who leaves a trail? The repo answers that.

Both readers can get what they need without the other's material getting in the way. The portfolio is the entry point; everything under `docs/` is the depth.

## The parts that were load-bearing

A handful of choices shaped the rest:

**Next.js 16 over SvelteKit** (ADR-0001). The Next + Turbopack + R3F ecosystem is where the Vercel-adjacent toolchain lives, and the agent-readiness work in Phase 4 needed Route Handlers + middleware at the shape Next provides. SvelteKit is a fine framework; this wasn't the site for it.

**Process-gate as pre-commit** (ADR-0002). Three rules: code needs CHANGELOG, structural changes need ADR, epic changes need ROADMAP. Enforced by a shell script running on `simple-git-hooks`. Cheap to implement, load-bearing against drift.

**MDX with server-only compilation + Shiki bundle isolation** (ADR-0004). `next-mdx-remote@6` compiles MDX inside React Server Components; Shiki stays out of the client bundle. Every PR re-verifies the isolation with `pnpm analyze`.

**Content negotiation, Pattern A and Pattern B** (ADR-0006). Every page has a `.md` alternate at `/page.md` (Pattern B, load-bearing) and responds to `Accept: text/markdown` at the canonical URL (Pattern A, additive). Pass isitagentready.com's content-negotiation check on both axes; lose neither if one misbehaves.

**The Wanderer crane port** (slice 5.1c). 221 lines of vanilla Three.js in the reference design, ported directly into a `useEffect`-driven scene (not R3F — the crane is a fixed-position full-document scene driven by document scroll, and wrapping that through R3F primitives reads worse than the direct port). Eight named POSES, IntersectionObserver-driven pose dispatch, damp lerp, scroll-velocity rotation + wing flap, pointer parallax. Bail-out to SVG fallback if first frame takes > 80ms.

## What's still to do

Launch isn't done, it's done enough to ship. The open follow-ups live in `docs/ROADMAP.md` — the honest list, including things like "tighten the JS bundle back toward 150 KiB once the bundle-analyzer audit identifies which preloaded chunk is currently blowing it past target," "write a real `/api/docs` page," and "run the isitagentready.com scan against prod and persist the screenshot to `docs/agent-readiness-snapshots/`."

None of those are in the "you shouldn't launch without this" bucket. They're in the "you should keep shipping past launch" bucket, which is where most real engagements live.

## If any of this looks useful

Every file I've referenced is in the public repo. Clone it, read it, take what's useful. If you're starting a portfolio or a product and you want the same process on it, [hello@akaushik.org](mailto:hello@akaushik.org).

- **Date** — 2026-04-21
- **Canonical** — https://akaushik.org/writing/building-this-portfolio

</post>

---

<post slug="micrograd-makemore">

# What I learned building micrograd and makemore from scratch

> A foundations-first reading of Karpathy's Zero to Hero — why re-implementing the thing is the only way to understand the thing.

There is a version of learning where you watch a video, nod along, feel the concept land, and move on. And there is a version where you close the video and type the thing yourself. They feel the same in the moment. They aren't.

Karpathy's *Zero to Hero* series runs from scalar-valued autograd to a transformer. I did both versions: first I watched, then I rebuilt. The second pass was slower, more frustrating, and the only one that counted.

Micrograd is the smaller of the two projects — a tiny reverse-mode autodiff engine that operates on scalars. Every value is a node; every operation builds the graph; backward() walks the graph in topological order and accumulates gradients. The implementation fits in a few hundred lines. What doesn't fit in a few hundred lines is the understanding — specifically, why broadcasting in manual backprop feels different once you've traced a gradient through it yourself rather than assumed PyTorch handles it. It does handle it. You just don't own that fact until you've been the one handling it.

Makemore extends the lesson across a progression: bigram counts, MLP, BatchNorm, WaveNet-style dilated convolutions, and finally a character-level transformer. Each step adds one idea. The exercise isn't to implement all the ideas at once — it's to hold the previous one in your head while you add the next. That sequencing is the pedagogy.

What actually changed after I built them: I stopped treating gradient flow as a magic substrate and started treating it as a data structure I can read. That's not a small thing when you're debugging a production model that's not converging.

The rest of the writing here applies the same habit — go to the foundations before you go to the abstraction — to production systems. Agent frameworks, embedding pipelines, operational AI for businesses that haven't deployed it before. Same method, different stack.

- **Date** — 2026-04-15
- **Canonical** — https://akaushik.org/writing/micrograd-makemore

</post>

---

<post slug="ai-for-msme">

# Notes on bringing AI to an MSME

> WhatsApp, paper ledgers, Tally, and a few spreadsheets. What actually moves the needle, and what doesn't.

The textile distributor I spent time with last year wasn't running his business badly. He was running it the way it works — WhatsApp threads with buyers and suppliers, paper ledgers for stock, Tally for the accountant, a few shared spreadsheets for the delivery schedule. Every piece of that system represents a decision that worked well enough to survive. The implicit argument for AI has to reckon with that.

The most common mistake in bringing AI to a business like this is treating the current system as a problem to be replaced. It isn't. It's a pragmatic equilibrium built on trust relationships, cash-flow timing, and muscle memory. An AI that demands new behavior in exchange for its benefits doesn't get adopted — the cost of the behavior change is too high and too immediate, while the benefit is too abstract and too uncertain.

What does earn its keep: removing typing, removing memory load, and removing the gap between what happened and what got recorded. Order capture by voice or photo, where the AI handles transcription and structuring, fits that pattern. Reconciliation assistance — "here's what the ledger says, here's what WhatsApp says, here are the two discrepancies" — fits it too. Summarization over the operator's own data, not over some generic knowledge base, fits it most naturally of all.

What doesn't earn its keep, at least not yet: autonomous decisioning on contested transactions. When a buyer disputes a quantity or a supplier adjusts an invoice, the resolution is relational, not informational. The operator needs to be in that conversation. An AI that tries to close it will get turned off.

Neev is where this thinking is landing — a modular operations platform for MSMEs, starting with textile distribution. The design constraint is that nothing in the workflow should require the operator to learn a new mental model. That's harder than it sounds.

- **Date** — 2026-03-20
- **Canonical** — https://akaushik.org/writing/ai-for-msme

</post>

---

<post slug="fastembed-to-tei">

# Migrating from Fastembed ONNX to Hugging Face TEI

> The specific trade-offs, the retrieval quality delta on an institutional corpus, and why the infra complexity was worth it.

Fastembed ONNX on CPU is a reasonable place to start a RAG system. The library is easy to install, the model catalog covers the obvious choices, and you can get a working embedding pipeline in an afternoon without standing up any additional infrastructure. For VeriCite's early proof-of-concept, that was the right call.

It stopped being the right call when retrieval quality became the constraint. The model catalog that's practical to serve through Fastembed doesn't include `paraphrase-multilingual-MiniLM-L12-v2` with the throughput we needed on an institutional corpus, and it doesn't include a cross-encoder reranker like `BAAI/bge-reranker-v2-m3` at all. Reranking matters disproportionately on the kind of corpus VeriCite deals with — documents where the lexical overlap between a query and the relevant passage is low, and where a bi-encoder's cosine similarity is a noisy signal.

Hugging Face TEI gives you both. It's a purpose-built inference server for text embeddings and rerankers, with batching tuned for GPU throughput. The trade-off is operational: you're now running GPU nodes, managing batching configuration, and shipping k8s manifests for a TEI sidecar alongside Qdrant. That's real infra complexity, and it doesn't pay for itself unless retrieval quality is actually on the critical path.

In this case it was. After the migration, retrieval quality on the institutional corpus improved measurably — specifically on queries where the relevant document used different vocabulary than the query itself. That's the problem the cross-encoder is solving: it sees the full (query, passage) pair rather than comparing independent embeddings. The bi-encoder gets you to the right neighborhood; the reranker gets you to the right door.

The infra overhead is now just a fixed cost. The model options going forward are much wider.

- **Date** — 2026-02-18
- **Canonical** — https://akaushik.org/writing/fastembed-to-tei

</post>