Introduction
In the rapidly evolving landscape of Large Language Model (LLM) applications, vector embeddings have become a crucial component for enhancing AI capabilities. These mathematical representations of text enable machines to understand semantic relationships and power features like intelligent search and contextual memory. Transformers.js, developed by Hugging Face, brings this powerful technology directly to JavaScript environments, allowing developers to generate embeddings locally without relying on external services.
Local embedding generation offers several advantages, including improved privacy, reduced latency, and lower operational costs. Whether you’re building an AI agent with memory capabilities or implementing Retrieval Augmented Generation (RAG), having control over your embedding pipeline is invaluable.
Understanding the Context
What are Vector Embeddings?
Vector embeddings transform text into high-dimensional numerical arrays that capture semantic meaning. When text is converted into these vector representations, similar concepts cluster together in the vector space, enabling mathematical comparisons of textual content. For example, the phrases “I love programming” and “I enjoy coding” would have similar vector representations despite using different words.
These embeddings typically consist of arrays with hundreds of dimensions (usually 384-768), each dimension contributing to the overall semantic representation of the text.
Common Use Cases
Vector embeddings are fundamental to several modern LLM applications:
- LLM Agent Memory Systems: Agents use embeddings to store and retrieve experiences and knowledge, enabling more contextual and informed responses.
- Retrieval Augmented Generation (RAG): Systems use embeddings to find relevant documents or context before generating responses, improving accuracy and relevance.
- Semantic Search: Embeddings enable searching by meaning rather than just keywords, delivering more intuitive results.
Getting Started with Transformers.js
Installation and Setup
To begin using Transformers.js for embedding generation, first install the package:
npm install @xenova/transformers
The library requires a modern JavaScript environment with support for ES modules. It works in both browser and Node.js contexts without additional dependencies.
For TypeScript users, you’ll want to set up a proper environment:
npm install typescript tsx @types/node
Using tsx
makes it easy to run TypeScript files directly:
npx tsx your-script.ts
Key Features for Embeddings
Transformers.js offers several important features:
- Supported Models: Compatible with popular embedding models like all-MiniLM-L6-v2 and all-mpnet-base-v2
- Environment Flexibility: Runs in browsers and Node.js environments
- Hardware Acceleration: Leverages WebGL when available for improved performance
- Quantized Models: Supports compressed models for efficient deployment
Code Example: Basic Setup
Here’s how to get started with basic embedding generation in TypeScript embedding-example.ts
:
import { pipeline } from '@xenova/transformers';
// Basic TypeScript type for embedding output
type EmbeddingOutput = {
data: number[];
dims: number;
};
// Async function to run the example
async function generateEmbedding() {
// Initialize the embedding pipeline
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
// Generate your first embedding
const text = "Hello, world!";
const output = await embedder(text, {
pooling: 'mean',
normalize: true
}) as EmbeddingOutput;
console.log(`Generated embedding with ${output.data.length} dimensions`);
console.log("First few values:", output.data.slice(0, 5));
}
// Execute the function
generateEmbedding().catch(console.error);
Code Explanation:
- We start by importing the
pipeline
function from Transformers.js, which provides a high-level API for various NLP tasks. - We define a TypeScript type
EmbeddingOutput
to specify the structure of the embedding result, which includes the embedding vector (data
) and its dimension information. - Inside our
generateEmbedding
function, we initialize the pipeline with the task'feature-extraction'
and specify the model'Xenova/all-MiniLM-L6-v2'
, a compact but powerful embedding model. - We then pass a simple text string to the model along with configuration options:
pooling: 'mean'
averages all token embeddings to create a single vector for the entire textnormalize: true
normalizes the vector to have unit length, which is important for similarity comparisons
- Finally, we log the embedding dimension (should be 384 for this model) and show the first few values of the embedding vector.
To run this example:
npx tsx embedding-example.ts
Implementing Vector Embeddings
Core Embedding Functionality
When implementing vector embeddings, several key aspects need consideration:
Model Selection:
- all-MiniLM-L6-v2: Balanced performance and size (384 dimensions)
- all-mpnet-base-v2: Higher accuracy but larger size (768 dimensions)
Configuration Options:
- Pooling strategy (mean, cls, max)
- Normalization settings
- Batch size for multiple inputs
Code Walkthrough
Here’s a comprehensive example of embedding generation in TypeScript embedding-generator.ts
:
import { pipeline } from '@xenova/transformers';
class EmbeddingGenerator {
private embedder: any;
constructor() {
this.embedder = null;
}
async initialize() {
console.log("Initializing embedding model...");
this.embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
console.log("Model loaded successfully");
}
async generateEmbedding(text: string): Promise<number[]> {
if (!this.embedder) {
throw new Error("Embedder not initialized. Call initialize() first.");
}
const output = await this.embedder(text, {
pooling: 'mean',
normalize: true
});
return output.data;
}
async batchProcess(texts: string[]): Promise<number[][]> {
return await Promise.all(texts.map(text => this.generateEmbedding(text)));
}
}
// Example usage
async function runExample() {
const generator = new EmbeddingGenerator();
await generator.initialize();
const embedding = await generator.generateEmbedding("This is a test sentence.");
console.log(`Embedding length: ${embedding.length}`);
const batchResults = await generator.batchProcess([
"First example text",
"Second different example"
]);
console.log(`Processed ${batchResults.length} embeddings`);
// Get first few embeddings for each result
batchResults.forEach((data) => {
console.log("First few values:", data.slice(0, 5));
});
}
// Run the example
runExample().catch(err => console.error("Error:", err));
Code Explanation:
- This example encapsulates embedding functionality in a reusable
EmbeddingGenerator
class for better structure and maintainability. - The class follows a proper initialization pattern:
- The constructor creates an instance but doesn’t load the model right away
- The
initialize()
method loads the model asynchronously, allowing for controlled startup - This approach helps manage resource usage and application flow
- The
generateEmbedding()
method includes proper error handling to ensure the model is initialized before use - The
batchProcess()
method demonstrates how to process multiple texts efficiently withPromise.all
- In the example usage:
- We create and initialize the generator
- Process a single text to demonstrate basic usage
- Process multiple texts to demonstrate batch capabilities
- Display useful information about the results
This class-based design provides a foundation you can extend for more complex applications, with clear separation of concerns and reusable embedding logic.
To run this example:
npx tsx embedding-generator.ts
Best Practices
To optimize your embedding implementation:
- Cache frequently used embeddings
- Process texts in batches when possible
- Implement proper error handling
- Monitor memory usage and implement cleanup strategies
- Use Web Workers for non-blocking operations in browser environments
- Consider model quantization for reduced memory footprint
Practical Applications
Building a RAG System
Implementing a RAG system with local embeddings involves several components. Here’s a TypeScript implementation rag-system.ts
:
// rag-system.ts
import { pipeline } from '@xenova/transformers';
interface Document {
text: string;
metadata?: Record<string, any>;
}
interface SearchResult {
document: Document;
similarity: number;
}
// Simple in-memory vector store implementation
class SimpleVectorStore {
private vectors: Array<{ id: number, vector: number[] }> = [];
private dimension: number;
constructor(dimension: number) {
this.dimension = dimension;
}
addPoint(vector: number[], id: number): void {
if (vector.length !== this.dimension) {
throw new Error(`Vector dimension mismatch: expected ${this.dimension}, got ${vector.length}`);
}
this.vectors.push({ id, vector });
}
searchKnn(queryVector: number[], k: number): [number[], number[]] {
if (queryVector.length !== this.dimension) {
throw new Error(`Query vector dimension mismatch: expected ${this.dimension}, got ${queryVector.length}`);
}
// Calculate distances and sort
const withDistances = this.vectors.map(item => ({
id: item.id,
distance: this.calculateDistance(queryVector, item.vector)
}));
// Sort by distance (ascending)
const sorted = withDistances.sort((a, b) => a.distance - b.distance);
// Take top k
const topK = sorted.slice(0, k);
// Split into separate arrays for ids and distances
const ids = topK.map(item => item.id);
const distances = topK.map(item => item.distance);
return [ids, distances];
}
// Euclidean distance squared (for efficiency)
private calculateDistance(vecA: number[], vecB: number[]): number {
return 1 - this.cosineSimilarity(vecA, vecB);
}
// Cosine similarity
private cosineSimilarity(vecA: number[], vecB: number[]): number {
const dotProduct = vecA.reduce((sum, val, i) => sum + val * vecB[i], 0);
const magA = Math.sqrt(vecA.reduce((sum, val) => sum + val * val, 0));
const magB = Math.sqrt(vecB.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magA * magB);
}
}
class RAGSystem {
private dimension: number;
private vectorStore: SimpleVectorStore;
private embedder: any;
private documents: Map<number, Document>;
constructor(dimension = 384) {
this.dimension = dimension;
this.vectorStore = new SimpleVectorStore(dimension);
this.embedder = null;
this.documents = new Map();
}
async initialize() {
console.log("Initializing embedding model...");
this.embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
console.log("Model loaded successfully");
}
async addDocument(text: string, metadata: Record<string, any> = {}): Promise<number> {
const embedding = await this.generateEmbedding(text);
const id = Date.now() + Math.floor(Math.random() * 1000); // Ensure unique IDs
this.vectorStore.addPoint(embedding, id);
this.documents.set(id, { text, metadata });
return id;
}
async query(question: string, k = 5): Promise<SearchResult[]> {
const queryEmbedding = await this.generateEmbedding(question);
const [indices, distances] = this.vectorStore.searchKnn(queryEmbedding, k);
return indices.map((id: number, i: number) => ({
document: this.documents.get(id) as Document,
similarity: 1 - distances[i] // Convert distance to similarity
}));
}
async generateEmbedding(text: string): Promise<number[]> {
const output = await this.embedder(text, {
pooling: 'mean',
normalize: true
});
return output.data;
}
}
// Usage example
async function testRAG() {
console.log("Creating RAG system...");
const rag = new RAGSystem();
console.log("Initializing RAG system...");
await rag.initialize();
// Add some documents
console.log("Adding documents to RAG system...");
await rag.addDocument("JavaScript is a programming language often used in web development.",
{ category: "programming" });
await rag.addDocument("Python is known for its simplicity and readability.",
{ category: "programming" });
await rag.addDocument("Climate change is affecting global weather patterns.",
{ category: "environment" });
// Query the system
console.log("\nQuerying the RAG system...");
const results = await rag.query("Tell me about coding languages");
console.log("\nQuery results:");
results.forEach((result, i) => {
console.log(`${i+1}. ${result.document.text}`);
console.log(` Category: ${result.document.metadata?.category}`);
console.log(` Similarity: ${result.similarity.toFixed(4)}`);
});
}
// Run the example
console.log("Starting RAG system example...");
testRAG().catch(error => {
console.error("Error in RAG system:", error);
});
Code Explanation:
- This implementation creates a complete Retrieval Augmented Generation (RAG) system with two main components:
- SimpleVectorStore: An in-memory vector database that:
- Stores vectors with associated IDs
- Validates vector dimensions for consistency
- Implements k-nearest neighbors (KNN) search using cosine similarity
- Returns the closest matching documents by similarity score
- RAGSystem: The main RAG implementation that:
- Manages both the embedding model and vector store
- Provides methods to add documents with metadata
- Allows querying the system to find relevant documents
- Handles the conversion between text and embeddings
- SimpleVectorStore: An in-memory vector database that:
- The search process works as follows:
- When adding a document, we generate its embedding and store both the vector and original text
- When querying, we convert the question to an embedding vector
- We use KNN search to find the most similar document vectors
- Results are returned with similarity scores and the original document content
- This example demonstrates a complete working RAG system that:
- Properly indexes documents with metadata
- Efficiently searches for relevant information
- Returns results ranked by semantic similarity
- Provides access to both the content and similarity scores
The example can be extended to handle larger document collections by swapping the in-memory store for a persistent vector database like Pinecone, Milvus, or Qdrant.
To run this RAG example:
npx tsx rag-system.ts
Implementing Agent Memory
Agent memory systems require both short-term and long-term storage strategies. Here’s a TypeScript implementation:
import { pipeline } from '@xenova/transformers';
interface Memory {
text: string;
timestamp: number;
importance: string;
context?: Record<string, any>;
}
class AgentMemory {
private shortTermMemory: Memory[];
private longTermStore: any;
private maxShortTermSize: number;
private embedder: any;
constructor() {
this.shortTermMemory = [];
this.longTermStore = null; // Will be initialized
this.maxShortTermSize = 100;
this.embedder = null;
}
async initialize() {
this.embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
// Initialize vector store or database for long-term memory
this.longTermStore = new VectorStore(this.embedder);
await this.longTermStore.initialize();
}
async addMemory(text: string, importance: string = 'normal', context: Record<string, any> = {}) {
// Create memory object
const memory: Memory = {
text,
timestamp: Date.now(),
importance,
context
};
// Add to short-term memory
this.shortTermMemory.push(memory);
// Manage short-term memory size
if (this.shortTermMemory.length > this.maxShortTermSize) {
await this.consolidateMemory();
}
// Store in long-term memory
await this.longTermStore.addDocument(text, { importance, ...context });
}
async consolidateMemory() {
// Move older items to long-term storage only
this.shortTermMemory = this.shortTermMemory.slice(-this.maxShortTermSize);
}
async recall(query: string, limit: number = 5) {
// Combine short-term and relevant long-term memories
const longTermResults = await this.longTermStore.query(query, limit);
// Merge results intelligently
return longTermResults;
}
}
// Simple Vector Store Implementation for the example
class VectorStore {
private embedder: any;
private documents: Array<{text: string, embedding: number[], metadata: any}>;
constructor(embedder: any) {
this.embedder = embedder;
this.documents = [];
}
async initialize() {
// Setup code if needed
}
async addDocument(text: string, metadata: any = {}) {
const embedding = await this.generateEmbedding(text);
this.documents.push({
text,
embedding,
metadata
});
}
async query(question: string, k: number = 5) {
const queryEmbedding = await this.generateEmbedding(question);
// Compute similarities with all documents
const withSimilarity = this.documents.map(doc => ({
document: { text: doc.text, metadata: doc.metadata },
similarity: this.cosineSimilarity(queryEmbedding, doc.embedding)
}));
// Sort by similarity (descending)
return withSimilarity
.sort((a, b) => b.similarity - a.similarity)
.slice(0, k);
}
async generateEmbedding(text: string) {
const output = await this.embedder(text, {
pooling: 'mean',
normalize: true
});
return output.data;
}
cosineSimilarity(vecA: number[], vecB: number[]): number {
const dotProduct = vecA.reduce((sum, val, i) => sum + val * vecB[i], 0);
const magA = Math.sqrt(vecA.reduce((sum, val) => sum + val * val, 0));
const magB = Math.sqrt(vecB.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magA * magB);
}
}
// Example usage
async function testAgentMemory() {
const memory = new AgentMemory();
await memory.initialize();
// Add some memories
await memory.addMemory("User asked about JavaScript frameworks", "high",
{ topic: "programming" });
await memory.addMemory("User mentioned they're working on a React project", "normal",
{ topic: "programming", framework: "React" });
await memory.addMemory("User seems frustrated with TypeScript configuration", "high",
{ topic: "programming", sentiment: "negative" });
// Recall relevant information
const results = await memory.recall("What JavaScript frameworks has the user mentioned?");
console.log("Agent memory recall results:");
results.forEach((result: any, i: number) => {
console.log(`${i+1}. ${result.document.text} (Relevance: ${result.similarity.toFixed(4)})`);
});
}
// Run the example
testAgentMemory().catch(console.error);
Code Explanation:
- This implementation models an AI agent’s memory system with both short-term and long-term storage:
- AgentMemory Class: The main memory management system that:
- Maintains an array of recent memories for quick access (short-term)
- Uses a vector store for semantic storage and retrieval (long-term)
- Handles memory consolidation when the short-term buffer reaches capacity
- Provides context-aware memory recall through semantic search
- Memory Interface: Structures memory entries with:
- The actual text content
- Timestamp for temporal tracking
- Importance level for prioritization
- Additional context as key-value metadata
- VectorStore Class: A simplified vector database that:
- Stores documents with their embeddings and metadata
- Provides semantic search functionality using cosine similarity
- Reuses the embedding model for consistency
- AgentMemory Class: The main memory management system that:
- The memory system works through these key operations:
- addMemory(): Stores a new memory in both short-term and long-term memory, with importance and context
- consolidateMemory(): Manages short-term memory size by removing older entries
- recall(): Retrieves relevant memories based on semantic similarity to a query
- The example demonstrates:
- Creating an agent memory system
- Adding several memories with different importance levels and context
- Retrieving memories relevant to a specific query
- How semantic search provides more intelligent recall than keyword matching
This pattern is particularly valuable for AI agents that need to maintain conversation context, remember user preferences, or accumulate knowledge over time.
To run this agent memory example:
npx tsx agent-memory.ts
Advanced Topics
Performance Optimization
To optimize performance in production environments embedding-cache.ts
:
Implement Caching:
import { pipeline } from '@xenova/transformers';
class EmbeddingCache {
private cache: Map<string, number[]>;
private maxSize: number;
constructor() {
this.cache = new Map();
this.maxSize = 1000;
}
async getEmbedding(text: string, generator: (text: string) => Promise<number[]>): Promise<number[]> {
const hash = this.hashText(text);
if (this.cache.has(hash)) {
return this.cache.get(hash)!;
}
const embedding = await generator(text);
this.cache.set(hash, embedding);
this.maintainCacheSize();
return embedding;
}
// Creates a simple hash for text to use as cache key
hashText(text: string): string {
let hash = 0;
for (let i = 0; i < text.length; i++) {
hash = ((hash << 5) - hash) + text.charCodeAt(i);
hash |= 0; // Convert to 32bit integer
}
return hash.toString();
}
// Ensures cache doesn't exceed maximum size
maintainCacheSize(): void {
if (this.cache.size > this.maxSize) {
// Remove oldest entries (FIFO)
const keysToDelete = Array.from(this.cache.keys())
.slice(0, this.cache.size - this.maxSize);
keysToDelete.forEach(key => this.cache.delete(key));
}
}
}
// Example usage
async function embeddingWithCache() {
const cache = new EmbeddingCache();
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const generateEmbedding = async (text: string) => {
const output = await embedder(text, {
pooling: 'mean',
normalize: true
});
return output.data;
};
console.time("first-embedding");
const embedding1 = await cache.getEmbedding("Hello world", generateEmbedding);
console.timeEnd("first-embedding");
console.time("cached-embedding");
const embedding2 = await cache.getEmbedding("Hello world", generateEmbedding);
console.timeEnd("cached-embedding");
console.log("Embedding dimensions:", embedding1.length);
}
// Run the example
embeddingWithCache().catch(console.error);
Code Explanation:
- The
EmbeddingCache
class implements a memory-efficient caching system for embedding vectors:- Core Functionality:
- Stores embedding vectors in a Map with text hashes as keys
- Implements a size limit to prevent unlimited memory growth
- Uses a simple but effective string hashing function
- Provides automatic cache management
- Main Method – getEmbedding():
- Takes both the text to embed and a generator function
- Checks if the text hash exists in the cache before generating
- Only calls the expensive embedding generation when needed
- Automatically caches new results
- Cache Management:
- The
maintainCacheSize()
method enforces the maximum cache size - Uses a FIFO (First-In-First-Out) eviction policy for simplicity
- Removes oldest entries when the cache exceeds its size limit
- The
- Core Functionality:
- The example demonstrates the cache’s effectiveness:
- First call generates and caches an embedding (should take longer)
- Second call retrieves from cache (should be nearly instantaneous)
- Time measurements show the dramatic performance difference
- This caching approach delivers several benefits:
- Dramatically reduces redundant computation
- Speeds up applications with repetitive queries
- Conserves resources, especially in web environments
- Improves user experience with faster response times
- Scales efficiently with user interaction patterns
As shown in the timing comparison, subsequent embedding requests for the same text will be orders of magnitude faster when served from the cache.
To run the caching example:
npx tsx embedding-cache.ts
Batch Processing Example:
import { pipeline } from '@xenova/transformers';
class BatchProcessor {
private embedder: any;
private batchSize: number;
constructor(batchSize: number = 16) {
this.embedder = null;
this.batchSize = batchSize;
}
async initialize() {
this.embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
}
async processBatch(texts: string[]): Promise<number[][]> {
if (!this.embedder) throw new Error("Embedder not initialized");
const batches = [];
// Create batches of optimal size
for (let i = 0; i < texts.length; i += this.batchSize) {
batches.push(texts.slice(i, i + this.batchSize));
}
const results = [];
// Process each batch
for (const batch of batches) {
console.log(`Processing batch of ${batch.length} texts`);
const batchPromises = batch.map(text => this.generateEmbedding(text));
const batchResults = await Promise.all(batchPromises);
results.push(...batchResults);
}
return results;
}
private async generateEmbedding(text: string): Promise<number[]> {
const output = await this.embedder(text, {
pooling: 'mean',
normalize: true
});
return output.data;
}
}
// Example usage
async function testBatchProcessing() {
const processor = new BatchProcessor(8);
await processor.initialize();
// Generate sample texts
const texts = Array.from({length: 25}, (_, i) =>
`This is sample text number ${i+1} for batch processing.`
);
console.time("batch-processing");
const embeddings = await processor.processBatch(texts);
console.timeEnd("batch-processing");
console.log(`Processed ${embeddings.length} embeddings`);
console.log(`Each embedding has ${embeddings[0].length} dimensions`);
}
testBatchProcessing().catch(console.error);
Code Explanation:
- The
BatchProcessor
class implements efficient batch processing for multiple texts:- Chunking Strategy:
- Takes a potentially large array of texts
- Divides them into smaller batches based on
batchSize
- This prevents overwhelming the system with too many parallel operations
- Optimizes for both memory usage and processing speed
- Parallel Processing:
- Uses
Promise.all()
to process texts within each batch in parallel - This maximizes throughput while keeping resource usage manageable
- Maps each text to its embedding generation function
- Uses
- Configuration:
- The constructor allows setting a custom
batchSize
- Default value of 16 works well in most environments
- Can be tuned based on available memory and CPU resources
- The constructor allows setting a custom
- Chunking Strategy:
- The example demonstrates processing 25 texts in batches of 8:
- Creates a processor with a specified batch size
- Generates sample texts programmatically
- Processes all texts with time measurement
- Reports on the total processed embeddings
- Benefits of batch processing:
- More efficient use of computational resources
- Better memory management for large datasets
- Prevents browser/application freezing
- Provides progress feedback for long-running operations
- Scales to handle datasets of any size
This pattern is essential when working with large document collections, allowing you to process hundreds or thousands of documents without memory issues.
To run the batch processing example:
npx tsx batch-processor.ts
Scaling Considerations
When scaling your embedding system:
- Implement streaming for large documents
- Consider distributed storage solutions
- Monitor and optimize memory usage
Conclusion
Local vector embeddings with Transformers.js provide a powerful solution for building sophisticated LLM applications. By implementing embeddings directly in JavaScript or TypeScript, developers can create privacy-focused, efficient systems for both RAG and agent memory applications.