Semantic Search in AILANG

AILANG provides a two-tier semantic search system for building AI agents with memory and retrieval capabilities:

Tier 1: SimHash - Fast, deterministic text similarity (no external dependencies)
Tier 2: Neural Embeddings - True semantic similarity via Ollama + EmbeddingGemma

This guide walks you through setting up both tiers, from basic to advanced configurations.

Quick Start (5 minutes)

Tier 1: SimHash-based Search (No Setup Required)

SimHash provides immediate semantic search without any external dependencies:

module examples/simhash_search

func main() -> string ! {IO, SharedIndex} {
  -- Store beliefs with SimHash
  let _ = _sharedindex_upsert("beliefs", "b1", _simhash("The sky is blue"), 1, 1000)
  in let _ = _sharedindex_upsert("beliefs", "b2", _simhash("Stars shine at night"), 1, 2000)
  in let _ = _sharedindex_upsert("beliefs", "b3", _simhash("Clouds contain water"), 1, 3000)

  -- Find similar to "What color is the sky?"
  in let query_hash = _simhash("What color is the sky?")
  in let results = _sharedindex_find_simhash("beliefs", query_hash, 3, 100, true)

  in "Found " ++ intToStr(_array_length(results)) ++ " results"
}

Run it:

ailang run --caps IO,SharedIndex --entry main examples/simhash_search.ail

How SimHash Works:

Converts text to a 64-bit fingerprint
Similar texts have similar fingerprints
Similarity = 1.0 - (hamming_distance / 64)
Fast, deterministic, no external dependencies
Best for: lexical similarity, near-duplicate detection

Setting Up Neural Embeddings (Tier 2)

For true semantic understanding ("cat" matches "kitten"), use neural embeddings.

Step 1: Install Ollama

Ollama runs AI models locally on your machine.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com

Step 2: Download an Embedding Model

We recommend EmbeddingGemma as the default:

ollama pull embeddinggemma

Model specs:

Size: 622MB (or 239MB for quantized q4_0)
Dimensions: 768
Parameters: 300M
Context: 2K tokens (~8000 chars)
Languages: 100+ (multilingual)
Latency: ~100ms per embedding (warm)

Quantized variants (same quality, less memory):

Variant	Size	Use Case
`embeddinggemma:latest`	622MB	Best quality, default
`embeddinggemma:300m-qat-q8_0`	338MB	Good balance
`embeddinggemma:300m-qat-q4_0`	239MB	Memory-constrained

# Pull quantized version for lower memory usage
ollama pull embeddinggemma:300m-qat-q4_0

Benchmark Results (v0.6.0):

All variants have similar warm inference speed (~100ms), difference is memory footprint:

Model	Cold Start	Warm	Memory
q4_0	~200ms	~94ms	239MB
q8_0	~200ms	~101ms	338MB
latest	~280ms	~102ms	622MB

Step 3: Start Ollama

ollama serve

Ollama runs on http://localhost:11434 by default.

Step 4: Verify Setup

Test with curl:

curl http://localhost:11434/api/embed -d '{
  "model": "embeddinggemma",
  "input": "Hello world"
}'

Or test with AILANG:

module test_embedding

func main() -> string ! {IO} {
  let emb = _ollama_embed("embeddinggemma", "Hello world")
  in "Got " ++ intToStr(_array_length(emb)) ++ "-dimensional embedding"
}

Neural Semantic Search Example

Full example using embeddings for semantic search:

module examples/neural_search

import std/string (floatToStr, intToStr)

-- Store a belief with its embedding
func store_belief(ns: string, key: string, text: string, ver: int, ts: int) -> unit ! {IO, SharedIndex} {
  let embedding = _ollama_embed("embeddinggemma", text)
  in let simhash = _simhash(text)
  in _sharedindex_upsert_emb(ns, key, simhash, embedding, ver, ts)
}

-- Search by semantic similarity
func search(ns: string, query: string, top_k: int) -> list[{key: string, score: float, version: int, timestamp: int}] ! {IO, SharedIndex} {
  let query_emb = _ollama_embed("embeddinggemma", query)
  in _sharedindex_find_by_embedding(ns, query_emb, top_k, 100, true)
}

func main() -> string ! {IO, SharedIndex} {
  -- Store beliefs
  let _ = store_belief("beliefs", "b1", "The sky is blue during daytime", 1, 1000)
  in let _ = store_belief("beliefs", "b2", "Stars are visible at night", 1, 2000)
  in let _ = store_belief("beliefs", "b3", "Clouds are made of water vapor", 1, 3000)
  in let _ = store_belief("beliefs", "b4", "Rain falls from clouds", 1, 4000)
  in let _ = store_belief("beliefs", "b5", "The ocean looks blue", 1, 5000)

  -- Query: "What color is the sky?"
  in let results = search("beliefs", "What color is the sky?", 3)
  in let _ = print_results(results)

  in "Demo complete"
}

func print_results(results: list[{key: string, score: float, version: int, timestamp: int}]) -> unit ! {IO} {
  match results {
    [] => _io_println("No results"),
    [first, ...rest] =>
      let _ = _io_println(first.key ++ " (score: " ++ floatToStr(first.score) ++ ")")
      in print_results(rest)
  }
}

Run:

ailang run --caps IO,SharedIndex --entry main examples/neural_search.ail

Recommended Embedding Models

General Purpose

Model	Dims	Size	Best For
`embeddinggemma`	768	622MB	Default choice - great quality, multilingual
`nomic-embed-text`	768	274MB	Smaller, English-focused
`mxbai-embed-large`	1024	669MB	Higher quality, larger dims

Specialized

Model	Dims	Size	Best For
`all-minilm`	384	46MB	Lightweight, fast, low memory
`snowflake-arctic-embed`	1024	669MB	Retrieval-focused
`bge-large`	1024	1.3GB	Chinese + English, high quality

Pull and Try Different Models

# Try different models
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull all-minilm

# Use in AILANG
let emb = _ollama_embed("nomic-embed-text", "Hello world")

Configuration Options

Environment Variables

# Custom Ollama endpoint
export OLLAMA_HOST=http://localhost:11434

# For remote Ollama server
export OLLAMA_HOST=http://your-server:11434

# Override embedding model (for CLI commands like `ailang docs search`)
export AILANG_OLLAMA_MODEL=embeddinggemma:300m-qat-q4_0

Config File (~/.ailang/config.yaml)

For persistent configuration:

embeddings:
  provider: ollama  # or "none" to disable
  ollama:
    model: embeddinggemma  # or embeddinggemma:300m-qat-q4_0
    endpoint: http://localhost:11434
    timeout: 30s
  search:
    default_mode: simhash
    simhash_threshold: 0.70
    neural_threshold: 0.75

Timeout Configuration

The _ollama_embed builtin uses a 30-second timeout by default. For large batches or slow networks, ensure Ollama is running locally.

Performance Tips

0. Large Documents Are Automatically Chunked

AILANG automatically chunks documents larger than 6000 characters for embedding. This is important because:

EmbeddingGemma has a 2K context window (~8000 chars)
Large documents are split at natural boundaries (headers, paragraphs, sentences)
Chunk embeddings are averaged to produce a single document embedding

Chunking priority order:

Markdown headers (## , # )
Code blocks (```)
Paragraphs (\n\n)
List items (\n- )
Sentences (. )
Line breaks (\n)
Words ( )

This means a 50KB design document will be split into ~8 chunks, each embedded separately, then averaged.

1. Batch Your Embeddings

Generate embeddings once at insert time, not at query time:

-- GOOD: Store embedding with data
func store_item(text: string) -> unit ! {IO, SharedIndex} {
  let emb = _ollama_embed("embeddinggemma", text)
  in _sharedindex_upsert_emb("items", key, _simhash(text), emb, 1, now())
}

2. Use SimHash for Pre-filtering

For large datasets, use SimHash as a fast first-pass filter:

-- 1. Fast SimHash pre-filter (no Ollama call)
let candidates = _sharedindex_find_simhash("items", _simhash(query), 100, 1000, true)

-- 2. Re-rank top candidates with embeddings (few Ollama calls)
-- ... load full items and compute embedding similarity

3. Keep Ollama Running

Don't restart Ollama between queries - model loading is the slowest part:

# Start once, keep running
ollama serve &

# Models stay loaded in memory

4. Use `maxScan` to Bound Search Time

-- Limit to scanning 500 entries max
let results = _sharedindex_find_by_embedding("items", query_emb, 10, 500, true)

SimHash vs Embeddings: When to Use Each

Feature	SimHash (Tier 1)	Embeddings (Tier 2)
Similarity type	Lexical (word overlap)	Semantic (meaning)
Setup	None	Ollama + model
Speed	~1μs	~160ms
Memory	8 bytes per entry	~6KB per entry (768 floats)
Determinism	Perfect	Model-dependent
"cat" ↔ "kitten"	Low score	High score
Best for	Near-duplicates, fast search	Semantic understanding

Use SimHash when:

You need instant results
No external dependencies allowed
Searching for near-duplicates
Memory is constrained

Use Embeddings when:

Semantic understanding matters
"What color is the sky?" should match "The sky appears blue"
Building conversational agents
Quality > speed

Hybrid Search Pattern

Combine both tiers for best results:

func hybrid_search(ns: string, query: string, top_k: int) -> list[search_result] ! {IO, SharedIndex} {
  -- Tier 1: Fast SimHash pre-filter (get top 100 candidates)
  let simhash_results = _sharedindex_find_simhash(ns, _simhash(query), 100, 1000, true)

  -- Tier 2: Re-rank with embeddings (slower, more accurate)
  in let query_emb = _ollama_embed("embeddinggemma", query)
  in let emb_results = _sharedindex_find_by_embedding(ns, query_emb, top_k, 100, true)

  in emb_results  -- Return embedding-ranked results
}

Troubleshooting

"Connection refused" error

Ollama isn't running:

ollama serve

"Model not found" error

Pull the model first:

ollama pull embeddinggemma

Slow embeddings

Check if model is loaded: ollama list
Ensure running locally (not over network)
Use a smaller model: all-minilm (46MB)

Out of memory

Use a smaller model:

ollama pull all-minilm  # Only 46MB

Or increase Ollama memory:

# macOS
export OLLAMA_MAX_VRAM=4096

# Start with more memory
ollama serve

API Reference

SimHash Builtins

-- Generate 64-bit SimHash
_simhash(text: string) -> int

-- Hamming distance between two hashes
_hamming_distance(a: int, b: int) -> int

-- Store with SimHash
_sharedindex_upsert(ns: string, key: string, simhash: int, ver: int, ts: int) -> unit ! {SharedIndex}

-- Find by SimHash similarity
_sharedindex_find_simhash(ns: string, query_hash: int, top_k: int, max_scan: int, deterministic: bool)
  -> list[{key: string, score: float, version: int, timestamp: int}] ! {SharedIndex}

Embedding Builtins

-- Generate embedding vector (requires Ollama)
_ollama_embed(model: string, text: string) -> list[float] ! {IO}

-- Store with SimHash + embedding
_sharedindex_upsert_emb(ns: string, key: string, simhash: int, embedding: list[float], ver: int, ts: int)
  -> unit ! {SharedIndex}

-- Find by embedding similarity (cosine)
_sharedindex_find_by_embedding(ns: string, query_emb: list[float], top_k: int, max_scan: int, deterministic: bool)
  -> list[{key: string, score: float, version: int, timestamp: int}] ! {SharedIndex}

Utility Builtins

-- Count entries in namespace
_sharedindex_entry_count(ns: string) -> int ! {SharedIndex}

-- List all namespaces
_sharedindex_namespaces(()) -> list[string] ! {SharedIndex}

-- Delete entry
_sharedindex_delete(ns: string, key: string) -> unit ! {SharedIndex}

CLI Semantic Search Commands

Document Search (`ailang docs search`)

Search design docs and markdown files semantically:

# Fast SimHash search (instant, no Ollama needed)
ailang docs search "parser error handling"

# Neural search with Ollama embeddings
ailang docs search "parser error handling" --neural

# Search specific directory
ailang docs search --path design_docs/planned "type inference"

# Filter by subdirectory
ailang docs search --stream implemented "monomorphization"

# Limit results
ailang docs search --limit 5 "error handling"

# JSON output for scripting
ailang docs search --json "semantic search"

# Cache management
ailang docs search --cache-info     # Show cache stats
ailang docs search --cleanup        # Remove orphaned entries
ailang docs search --rebuild        # Rebuild all embeddings

Neural search flags:

--neural - Enable neural embeddings via Ollama
--neural-candidates N - Number of SimHash candidates to re-rank (default: 10)

Performance with neural search (v0.6.0 benchmarks):

First run: ~4-6s per document (embedding with chunking for large docs)
Full corpus (29 design docs): ~128s cold start
Subsequent runs: <1s (using cached embeddings)
Cache is per-corpus, per-model, with content hash staleness detection

Messaging System Semantic Search

The AILANG CLI also provides semantic search for the messaging system:

# Search messages by semantic content
ailang messages search "parser error handling"

# Use neural search (requires Ollama)
ailang messages search "type inference bugs" --neural

# Find similar messages
ailang messages list --similar-to MSG_ID

# Deduplicate messages
ailang messages dedupe --apply

Configure neural search in ~/.ailang/config.yaml:

embeddings:
  provider: ollama
  ollama:
    model: nomic-embed-text
    endpoint: http://localhost:11434
    timeout: 30s
  search:
    default_mode: simhash
    simhash_threshold: 0.70
    neural_threshold: 0.75

See Agent Messaging - Semantic Search for full documentation.

Next Steps

Semantic Caching vs Vector DBs - When to use which tool
AI Effect Guide - Using the AI effect for LLM calls
Agent Messaging - Building multi-agent systems (includes semantic search)
Examples - More working examples

Quick Start (5 minutes)​

Tier 1: SimHash-based Search (No Setup Required)​

Setting Up Neural Embeddings (Tier 2)​

Step 1: Install Ollama​

Step 2: Download an Embedding Model​

Step 3: Start Ollama​

Step 4: Verify Setup​

Neural Semantic Search Example​

Recommended Embedding Models​

General Purpose​

Specialized​

Pull and Try Different Models​

Configuration Options​

Environment Variables​

Config File (~/.ailang/config.yaml)​

Timeout Configuration​

Performance Tips​

0. Large Documents Are Automatically Chunked​

1. Batch Your Embeddings​

2. Use SimHash for Pre-filtering​

3. Keep Ollama Running​

4. Use maxScan to Bound Search Time​

SimHash vs Embeddings: When to Use Each​

Hybrid Search Pattern​

Troubleshooting​

"Connection refused" error​

"Model not found" error​

Slow embeddings​

Out of memory​

API Reference​

SimHash Builtins​

Embedding Builtins​

Utility Builtins​

CLI Semantic Search Commands​

Document Search (ailang docs search)​

Messaging System Semantic Search​

Next Steps​