Indexing Guide
InputLayer provides HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest neighbor search on vector data.
Why Use Indexes?
Without an index, vector similarity queries perform a linear scan:
- 10K vectors: ~10ms
- 100K vectors: ~100ms
- 1M vectors: ~1s
With an HNSW index:
- 10K vectors: ~1ms
- 100K vectors: ~5ms
- 1M vectors: ~10ms
Trade-off: Indexes use memory and may return approximate (not exact) results.
Creating an Index
Basic Syntax
.index create <name> on <relation>(<column>) [options]
Simple Example
// Create a documents table with embeddings
+documents(id: int, title: string, embedding: vector)
// Insert some documents
+documents(1, "Introduction to ML", [0.1, 0.2, 0.3, 0.4])
+documents(2, "Vector Databases", [0.15, 0.25, 0.28, 0.42])
+documents(3, "Graph Theory", [0.8, 0.1, 0.05, 0.05])
.index create doc_emb_idx on documents(embedding)
With Options
.index create doc_emb_idx on documents(embedding) metric cosine m 16 ef_search 50
Index Options
Distance Metrics
| Metric | Aliases | Use Case |
|---|---|---|
cosine | cos | Text embeddings (most common) |
euclidean | l2, euclid | Image embeddings |
dot | dotproduct, inner | When vectors have meaningful magnitude |
manhattan | l1, taxicab | Sparse vectors |
Default: cosine
.index create my_idx on vectors(embedding) metric l2
.index create my_idx on vectors(embedding) metric cosine
.index create my_idx on vectors(embedding) metric dot
HNSW Parameters
| Parameter | Default | Description |
|---|---|---|
m | 16 | Max connections per node (higher = better recall, more memory) |
ef_construction | 200 | Construction-time ef (higher = better quality, slower build) |
ef_search | 50 | Search-time ef (higher = better recall, slower search) |
.index create my_idx on vectors(embedding) m 32 ef_search 100
Parameter Tuning
For higher recall (more accurate results):
.index create my_idx on vectors(embedding) m 32 ef_construction 400 ef_search 100
For faster search (lower recall):
.index create my_idx on vectors(embedding) m 8 ef_search 20
For large datasets (millions of vectors):
.index create my_idx on vectors(embedding) m 48 ef_construction 500 ef_search 200
Managing Indexes
List All Indexes
.index
or
.index list
Output:
| Name | Relation | Column | Type | Metric | Valid |
|---|---|---|---|---|---|
| doc_emb_idx | documents | embedding | hnsw | cosine | yes |
View Index Statistics
.index stats doc_emb_idx
Output:
Index: doc_emb_idx
Relation: documents
Column: embedding
Type: hnsw
Metric: cosine
Vectors: 10000
Dimension: 768
Valid: yes
Tombstones: 0
Built: 2024-01-15 10:30:00
Rebuild an Index
After many insertions/deletions, an index may become fragmented. Rebuild to optimize:
.index rebuild doc_emb_idx
Drop an Index
.index drop doc_emb_idx
Using Indexes in Queries
Use the hnsw_nearest() predicate in query bodies to perform fast approximate nearest-neighbor search via an HNSW index.
Syntax
hnsw_nearest("index_name", QueryVec, K, IdVar, DistVar [, EfSearch])
index_name— String literal naming the HNSW indexQueryVec— Variable bound to a vector, or a vector literal[1.0, 2.0]K— Integer: number of nearest neighborsIdVar— Variable to bind result tuple IDsDistVar— Variable to bind distancesEfSearch— Optional integer: override ef_search for this query
Examples
// Find 5 nearest documents to a literal query vector
? hnsw_nearest("doc_emb_idx", [0.11, 0.21, 0.29], 5, Id, Dist)
// Use a bound query vector and join with base data
? query_vec(QV), hnsw_nearest("doc_emb_idx", QV, 10, Id, Dist), documents(Id, Title, _)
// Override ef_search for higher recall
? hnsw_nearest("doc_emb_idx", [0.11, 0.21, 0.29], 5, Id, Dist, 200)
Note: Indexes are not used automatically — you must explicitly call hnsw_nearest() to use an HNSW index.
Index Lifecycle
Build Phase
When you create an index, vectors are inserted incrementally:
- Index is registered with metadata
- Existing vectors are added to the HNSW structure
- Index is marked as valid
Invalidation
Indexes are automatically invalidated when:
- Base relation is modified (insert/delete)
- Schema changes
.index stats my_idx
Valid: no ← Index needs rebuild
Rebuild
Invalid indexes are rebuilt on:
- Explicit
.index rebuildcommand - Next query that uses the index
Index Architecture
HNSW Structure
Layer 3: *-------------*
| |
Layer 2: *---*-----*---*---*
| | | | |
Layer 1: *-*-*-*-*-*-*-*-*-*-*
| | | | | | | | | | |
Layer 0: *********************** (all nodes)
Each layer has fewer nodes. Search starts at top layer and descends:
- Find nearest nodes in current layer
- Use those as entry points for next layer
- Repeat until layer 0
- Return k nearest neighbors
Memory Usage
HNSW indexes use approximately:
memory approx n * (d * 4 + m * 8) bytes
Where:
n= number of vectorsd= vector dimensionm= max connections parameter
Example: 1M vectors × 768 dimensions × m=16:
1M × (768 × 4 + 16 × 8) = ~3.2 GB
Tombstones and Compaction
When vectors are deleted, they're marked with a tombstone rather than removed immediately:
.index stats my_idx
Vectors: 10000
Tombstones: 500 ← Deleted entries not yet cleaned up
Automatic Compaction
When the tombstone ratio exceeds 30%, the index is automatically rebuilt inline during the delete operation.
Manual Compaction
Force a rebuild to remove tombstones:
.index rebuild my_idx
Best Practices
1. Choose the Right Metric
| Embedding Type | Recommended Metric |
|---|---|
| OpenAI embeddings | cosine |
| BERT/Sentence-BERT | cosine |
| Image embeddings (CLIP) | cosine |
| Raw feature vectors | euclidean |
| Pre-normalized vectors | dot (fastest) |
2. Tune Parameters for Your Use Case
Discovery/Exploration (higher recall matters):
.index create my_idx on docs(emb) m 32 ef_search 100
Production/Speed (latency matters):
.index create my_idx on docs(emb) m 16 ef_search 30
3. Monitor Index Health
Regularly check:
.index stats my_idx
Rebuild if:
- Tombstone ratio > 30%
- Search quality degrades
- After bulk insertions
4. Create Indexes Before Bulk Load
For large initial loads, create the index first:
my_idx on docs(emb)
// Then bulk insert
+docs[(1, "...", [0.1, ]),
(2, "...", [0.2, ]),
]
5. Use Appropriate Vector Dimensions
Common dimensions:
- OpenAI text-embedding-3-small: 1536
- Cohere embed-english-v3: 1024
- all-MiniLM-L6-v2: 384
- CLIP ViT-B/32: 512
Higher dimensions = more memory, slower search.
Troubleshooting
Index Shows "Invalid"
Cause: Base relation was modified.
Solution:
.index rebuild my_idx
Search Returns No Results
Possible causes:
- Index not yet built
- Query vector dimension mismatch
- No vectors in relation
Debug:
my_idx
? documents(Id, _, V) // Check if data exists
Poor Search Quality
Causes:
- Wrong distance metric for embedding type
- ef_search too low
- Many tombstones
Solutions:
// Check metric matches embedding type
my_idx
// Increase ef_search
my_idx on docs(emb) ef_search 100
// Rebuild to remove tombstones
my_idx
High Memory Usage
Solutions:
- Reduce
mparameter (trades recall for memory) - Use vector quantization (see Vectors Guide)
- Consider approximate embeddings with lower dimensions
Next Steps
- Vector Search Tutorial - Distance functions and semantic search
- Configuration Guide - Index persistence settings