Hybrid search with QDrant, OpenAI, and re-rank algorithms

As an introduction to this article, please read my previous notes:

IR brief theory

Sparse vectors have a large number of dimensions, where only a small portion of values are non-zero. When used for keyword search, each sparse vector represents a document; the dimensions represent words from a dictionary, and the values represent the importance (e.g. by BM25) of these words in the document.

To convert your text corpus to sparse vectors, you can use BM25 (more common) or SPLADE.

Sparse Vectors shine in domains and scenarios where many rare keywords or specialized terms are present. For example, in the medical domain, many rare terms are not present in the general vocabulary, so general-purpose dense vectors cannot capture the nuances of the domain.

I. Configure QDrant to store sparse vectors.

Since version 1.71 QDrant supports sparse vectors natively: it doesn't matter who created the vector, the structure of a sparse vector in QDrant is JSON with 2 array fields (actually emulating key-value pairs)

{
   "indices":[
      1012,
      1996,
      25309
   ],
   "values":[
      0.06361289,
      1.0990041,
      0.08670003
   ]
}

The corresponding configuration looks like this:

client = QdrantClient(url="https://e9d74ef0-b9f2-4b44-b5f0-e22ea1d6fc34.europe-west3-0.gcp.cloud.qdrant.io:6334",
                      api_key="...",
                      prefer_grpc=True)
# client = QdrantClient("localhost")

client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={},
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(
                on_disk=False,
            )
        )
    }
)

II. How to create a sparse vector.

Using transformers define tokenizer and and model

from transformers import AutoModelForMaskedLM, AutoTokenizer
model_id = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

Now you need a torch (better with CUDA support) because tokens are easy manipulated as tensors

import torch
tokens = tokenizer(prompt, return_tensors="pt")
output = model(**tokens)

Compute a vector from logits and attention mask using ReLU, log, and max operations.

logits, attention_mask = output.logits, tokens.attention_mask
# SPLADE takes the probability distribution from the MLM step
# and aggregates them into a single distribution called the “Importance Estimation.”.
# This distribution represents the sparse vector, highlighting relevant tokens that may not exist in the original input sequence.
relu_log = torch.log(1 + torch.relu(logits))
weighted_log = relu_log * attention_mask.unsqueeze(-1)
max_val, _ = torch.max(weighted_log, dim=1)
vec = max_val.squeeze()

4. Summarize steps 1).-3). into function

def compute_sparse_vector(prompt, tokenizer, model):
    """
    Computes a vector from logits and attention mask using ReLU, log, and max operations.

    Args:
    logits (torch.Tensor): The logits output from a model.
    attention_mask (torch.Tensor): The attention mask corresponding to the input tokens.

    Returns:
    torch.Tensor: Computed vector.
    """
    tokens = tokenizer(prompt, return_tensors="pt")
    output = model(**tokens)
    logits, attention_mask = output.logits, tokens.attention_mask
    relu_log = torch.log(1 + torch.relu(logits))
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    max_val, _ = torch.max(weighted_log, dim=1)
    vec = max_val.squeeze()

    return vec, tokens

III. Search with sparse vector

prompt = "הרעב"
query_vec, query_tokens = compute_sparse_vector(prompt)
query_indices = query_vec.nonzero().numpy().flatten()
query_values = query_vec.detach().numpy()[query_indices]

results = client.search(
    collection_name=COLLECTION_NAME,
    query_vector=models.NamedSparseVector(
            name="text-sparse",
            vector=models.SparseVector(
                indices=query_indices,
                values=query_values
            )
        ),
    limit=3
)

IV. Add semantic search.

We talked about sparse vector search, we've finished the previous part with the following QDrant API call:

results = client.search(
    collection_name=COLLECTION_NAME,
    query_vector=models.NamedSparseVector(
            name="text-sparse",
            vector=models.SparseVector(
                indices=query_indices,
                values=query_values
            )
        ),
    limit=3
)

Now add the semantic search to this approach by executing both concurrently. Thanks to search_batch API it is fairly easy:

search_queries = [
    models.SearchRequest(
        vector=models.NamedVector(
            name="text-dense",
            vector=compute_dense_vector(prompt)
        ),
        limit=10,
        with_payload=True
    ),
    models.SearchRequest(
        vector=models.NamedSparseVector(
            name="text-sparse",
            vector=models.SparseVector(
                indices=query_indices,
                values=query_values,
            ),
        ),
        limit=10,
        with_payload=True
    )

]

and then send this list toquery_dense_vector

result = client.search_batch(
    collection_name=COLLECTION_NAME,
    requests=search_queries
)

The vector passed into first SearchRequest is an embedding that produced by any corresponding engine, in particulat with OpenAI 'text-embedding-3-large' model. Symmetrically to create_sparse_vector() function add additional one create_dense_vector() that will expoit OpenAI AI to create embeddings

def compute_dense_vector(prompt, model="text-embedding-3-large"):
    prompt = prompt.replace("\n", " ")
    return openAIClient.embeddings.create(input=prompt, model=model).data[0].embedding

search_batch() method returns list of arrays

Oleg Kleiman's blog