Vector-Based Search
🎓 Learn how it works with the Amazon S3 Vectors with Spice engineering blog post.
Spice provides advanced vector-based search capabilities, enabling more nuanced and intelligent searches.
Embedding Models​
Spice supports two types of embedding providers:
- Local embedding models e.g., sentence-transformers/all-MiniLM-L6-v2.
- Remote embedding services e.g., OpenAI Embeddings API.
Embedding models are defined in the spicepod.yaml file as top-level components.
embeddings:
  - name: openai_embeddings
    from: openai
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
  - name: local_embedding_model
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Configuring Datasets for Embeddings​
To enable vector search, specify embeddings for the dataset columns in spicepod.yaml:
datasets:
  - from: github:github.com/spiceai/spiceai/issues
    name: spiceai.issues
    params:
      github_token: ${ secrets:GITHUB_TOKEN }
    acceleration:
      enabled: true
    columns:
      - name: body
        embeddings:
          - from: local_embedding_model
This configuration instructs Spice to create embeddings from the body column, enabling similarity searches on body content.
Performing a Vector Search​
Execute similarity searches using Spice's HTTP API:
curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["spiceai.issues"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 2
  }'
For detailed API documentation, see Search API Reference.
Retrieving Full Documents​
If the dataset uses chunking, Spice returns relevant chunks. To retrieve entire documents, include the embedding column in additional_columns:
curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["spiceai.issues"],
    "text": "cutting edge AI",
    "where": "array_has(assignees, \"jeadie\")",
    "additional_columns": ["title", "state", "body"],
    "limit": 2
  }'
Response:
{
  "matches": [
    {
      "value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
      "dataset": "spiceai.issues",
      "metadata": {
        "title": "Improve scalar UDF array_distance",
        "state": "Closed",
        "body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
      },
      "score": 0.66,
    },
    {
      "value": "est external tools being returned for toolusing models",
      "dataset": "spiceai.issues",
      "metadata": {
        "title": "Automatic NSQL retries in /v1/nsql ",
        "state": "Open",
        "body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n  "sql": "SELECT ..."\n}\n```"
      },
      "score": 0.52,
    }
  ],
  "duration_ms": 45
}
SQL UDTF​
The embedding index can also be used to perform search in SQL, via a user-defined table function (UDTF).
SELECT id, title, score
FROM vector_search('sales', 'cutting edge AI')
ORDER BY score DESC
LIMIT 5;
SQL Function Signature of vector_search:
vector_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Search text (required)
  col STRING,            -- Column name (optional if single embedding column)
  limit INTEGER,         -- Results limit (default: 1000)
  include_score BOOLEAN  -- Include relevance scores (default: TRUE)
)
RETURNS TABLE                -- The original table and:
                             --  - A FLOAT column `score` (if `include_score`).
By default, vector_search retrieves up to 1000 results. To adjust this limit, specify the limit parameter in the function call. When using a specific vector engine, such as s3_vectors the limit defaults to that of the vector engine.
SELECT id, title, score
FROM vector_search('sales', 'cutting edge AI', 1500)
ORDER BY score DESC;
- vector_searchUDTF does not yet support chunked embedding columns. Chunking support is on the roadmap.
Using Existing Embeddings​
Spice supports vector searches on datasets with pre-existing embeddings. Ensure the dataset meets these requirements:
- Column Naming: The embedding column name must be <original_column_name>_embedding.
- Data Types: Embedding columns must use Arrow types:
- Non-chunked: FixedSizeList[Float32|Float64, N]
- Chunked: List[FixedSizeList[Float32|Float64, N]]
 
- Non-chunked: 
- Offset Columns: For chunked embeddings, an additional offset column (<column_name>_offsets) is required:- Type: List[FixedSizeList[Int32, 2]], indicating chunk boundaries.
 
- Type: 
Example dataset structure (sales table):
Non-chunked:
sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name       | data_type                               | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number      | Int64                                   | YES         |
| quantity_ordered  | Int64                                   | YES         |
| price_each        | Float64                                 | YES         |
| order_line_number | Int64                                   | YES         |
| address           | Utf8                                    | YES         |
| address_embedding | FixedSizeList(                          | NO          |
|                   |   Field {                               |             |
|                   |     name: "item",                       |             |
|                   |     data_type: Float32,                 |             |
|                   |     nullable: false,                    |             |
|                   |     dict_id: 0,                         |             |
|                   |     dict_is_ordered: false,             |             |
|                   |     metadata: {}                        |             |
|                   |   },                                    |             |
|                   |   384                                   |             |
+-------------------+-----------------------------------------+-------------+
Chunked:
sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name       | data_type                               | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number      | Int64                                   | YES         |
| quantity_ordered  | Int64                                   | YES         |
| price_each        | Float64                                 | YES         |
| order_line_number | Int64                                   | YES         |
| address           | Utf8                                    | YES         |
| address_embedding | List(Field {                            | NO          |
|                   |   name: "item",                         |             |
|                   |   data_type: FixedSizeList(             |             |
|                   |     Field {                             |             |
|                   |       name: "item",                     |             |
|                   |       data_type: Float32,               |             |
|                   |     },                                  |             |
|                   |     384                                 |             |
|                   |   ),                                    |             |
|                   | })                                      |             |
+-------------------+-----------------------------------------+-------------+
| address_offset    | List(Field {                            | NO          |
|                   |   name: "item",                         |             |
|                   |   data_type: FixedSizeList(             |             |
|                   |     Field {                             |             |
|                   |       name: "item",                     |             |
|                   |       data_type: Int32,                 |             |
|                   |     },                                  |             |
|                   |     2                                   |             |
|                   |   ),                                    |             |
|                   | })                                      |             |
+-------------------+-----------------------------------------+-------------+
Constraints​
- 
Underlying Column Presence: - The underlying column must exist in the table, and be of stringArrow data type .
 
- The underlying column must exist in the table, and be of 
- 
Embeddings Column Naming Convention: - For each underlying column, the corresponding embeddings column must be named as <column_name>_embedding. For example, acustomer_reviewstable with areviewcolumn must have areview_embeddingcolumn.
 
- For each underlying column, the corresponding embeddings column must be named as 
- 
Embeddings Column Data Type: - The embeddings column must have the following Arrow data type when loaded into Spice:
- FixedSizeList[Float32 or Float64, N], where- Nis the dimension (size) of the embedding vector.- FixedSizeListis used for efficient storage and processing of fixed-size vectors.
- If the column is chunked, use List[FixedSizeList[Float32 or Float64, N]].
 
 
- The embeddings column must have the following Arrow data type when loaded into Spice:
- 
Offset Column for Chunked Data: - If the underlying column is chunked, there must be an additional offset column named <column_name>_offsetswith the following Arrow data type:- List[FixedSizeList[Int32, 2]], where each element is a pair of integers- [start, end]representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.
 - For instance, [[0, 100], [101, 200]]indicates two chunks covering indices 0–100 and 101–200, respectively.
 
 
- If the underlying column is chunked, there must be an additional offset column named 
By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.
Example​
A table sales with an address column and corresponding embedding column(s).
