- The first array for the indices contains the indices of the non-zero dimensions.
- The second array for values contains the floating point values for the non-zero dimensions.
- Information Retrieval and Text Analysis: By representing documents as sparse vectors where each token/word would correspond to a dimension in high dimensional vocabulary; and varying values by the frequencies of the tokens/words in the document or by weighting them with inverse document frequencies to favor rare terms, you can build complex search pipelines.
- Recommender Systems: By representing user interactions, preferences, ratings, or purchases as sparse vectors, you can identify relevant recommendations, and personalize content delivery.
Creating Sparse Vectors
There are various ways to create sparse vectors. You can use BM25 for information retrieval tasks, or use models like SPLADE that enhance documents and queries with term weighting and expansion. Upstash gives you full control by allowing you to upsert and query sparse vectors. Also, to make embedding easier for you, Upstash provides some hosted models and allows you to upsert and query text data. Behind the scenes, the text data is converted to sparse vectors. You can create your index with a sparse embedding model to use this feature.BGE-M3 Sparse Vectors
BGE-M3 is a multi-functional, multi-lingual, and multi-granular model widely used for dense indexes. We also provide BGE-M3 as a sparse vector embedder, which outputs sparse vectors from250_002
dimensional space.
These sparse vectors have values where each token is weighted
according to the input text, which enhances traditional sparse vectors
with contextuality.
BM25 Sparse Vectors
BM25 is a popular algorithm used in full-text search systems to rank documents based on their relevance to a query. This algorithm relies on key principles of term frequency, inverse document frequency, and document length normalization, making it well-suited for text retrieval tasks.- Rare terms are important: BM25 gives more weight to words that are less common in the collection of documents. For example, in a search for “Upstash Vector”, the word “Upstash” might be considered more important than “Vector” if it appears less frequently across all documents.
- Repeating a Word Helps—But Only Up to a Point: BM25 considers how often a word appears in a document, but it limits the benefit of repeating the word too many times. This means mentioning “Upstash” a hundred times won’t make a document overly important compared to one that mentions it just a few times.
- Shorter Documents Often Rank Higher: Shorter documents that match the query are usually more relevant. BM25 adjusts for document length so longer documents don’t get unfairly ranked just because they contain more words.
f(qᵢ, D)
is the frequency of termqᵢ
in documentD
.|D|
is the length of documentD
.avg(|D|)
is the average document length in the collection.k₁
is the term frequency saturation parameter.b
is the length normalization parameter.IDF(qᵢ)
is the inverse document frequency of termqᵢ
k₁
=1.2
, a widely used value in the absence of advanced optimizationsb
=0.75
, a widely used value in the absence of advanced optimizationsavg(|D|)
=32
, which was chosen by tokenizing and taking the average of MSMARCO dataset vectors, rounded to the nearest power of two.
IDF(qᵢ)
, we maintain that information
per token in the vector database itself. You can use it by providing it
as the weighting strategy for your queries so that you don’t have to weight
it yourself.
Using Sparse Indexes
Upserting Sparse Vectors
You can upsert sparse vectors into Upstash Vector indexes in two different ways.Upserting Sparse Vectors
You can upsert sparse vectors by representing them as two arrays of equal sizes. One signed 32-bit integer array for non-zero dimension indices, and one 32-bit float array for the values.1_000
non-zero valued dimension.
Upserting Text Data
If you created the sparse index with an Upstash-hosted sparse embedding model, you can upsert text data, and Upstash can embed it behind the scenes.Querying Sparse Vectors
Similar to upserts, you can query sparse vectors in two different ways.Querying with Sparse Vectors
You can query sparse vectors by representing the sparse query vector as two arrays of equal sizes. One signed 32-bit integer array for non-zero dimension indices, and one 32-bit float array for the values. We use the inner product similarity metric while calculating the similarity scores, only considering the matching non-zero valued dimension indices between the query vector and the indexed vectors.Querying with Text Data
If you created the sparse index with an Upstash-hosted sparse embedding model, you can query with text data, and Upstash can embed it behind the scenes before performing the actual query.Weighting Query Values
For algorithms like BM25, it is important to take the inverse document frequencies that make matching rare terms more important into account. It might be tricky to maintain that information yourself, so Upstash Vector provides it out of the box. To make use of IDF in your queries, you can pass it as a weighting strategy. Since this is mainly meant to be used with BM25 models, the IDF is defined as:N
is the total number of documents in the collection.n(qᵢ)
is the number of documents containing termqᵢ
.