QueryProcessor

Processes queries against generated hybrid indexes.

This class provides a method to retrieve tables from hybrid indexes, helping people find the relevant tables for their tasks.

Attributes

pipe (OpenAI | TextGenerationPipeline): The LLM pipeline for inference.
embedding_model (OpenAI | SentenceTransformer): The model used for text embeddings.
stemmer (Stemmer): A stemming tool used for text normalization.
index_path (str): Path to the directory where indexes are stored.
vector_index_path (str): Path for vector-based indexing.
fulltext_index_path (str): Path for full-text search indexing.

init

__init__(
    llm: OpenAI | TextGenerationPipeline,
    embed_model: OpenAI | SentenceTransformer,
    index_path: str,
)

query

query(
    index_name: str,
    queries: str | list[str],
    k: int = 1,
    n: int = 5,
    alpha: float = 0.5,
) -> str

Retrieves tables for the given queries against the index index_name.

Args

index_name (str): The name of the index to be retrieved against.
queries (str | list[str]): The query of list of queries to be executed.
k (int): The number of documents associated with the tables to be retrieved.
n (int): The multiplicative factor of k to pool more relevant documents for the hybrid retrieval process.
alpha (float): The weighting factor of the vector and full-text retrievers within a hybrid index. Lower alpha gives more weight to the vector retriever.

Returns

str: A JSON string representing the result of the process (Response).

__get_retrievers

__get_retrievers(index_name: str) -> tuple[Collection, bm25s.BM25]

Get both vector and full-text retrievers of the index index_name.

Args

index_name (str): The name of the hybrid index.

Returns

tuple[Collection, bm25s.BM25]: The vector and full-text retrievers.

__hybrid_retriever

__hybrid_retriever(
    bm25_retriever: BM25,
    vec_retriever: Collection,
    bm25_res: tuple[ndarray, ndarray],
    vec_res: QueryResult,
    k: int,
    query: str,
    alpha: float,
    query_tokens: Tokenized,
    query_embedding: list[float],
) -> list[tuple[str, float, str]]

Generates a hybrid index with name index_name for a given table_ids.

Args

bm25_retriever (BM25): The full-text retriever within the hybrid index.
vec_retriever (Collection): The vector retriever within the hybrid index.
bm25_res (tuple[ndarray, ndarray]): Retrieval results from the full-text retriever.
vec_res (QueryResult): Retrieval results from the vector retriever.
k (int): The number of documents retrieved from both retrievers.
query (str): The query.
alpha (float): The weighting factor of the vector and full-text retrievers within a hybrid index. Lower alpha gives more weight to the vector retriever.
query_tokens (Tokenized): The tokenized query.
query_embeddings (list[float]): The embedding of the query

Returns

list[tuple[str, float, str]]: The result of the hybrid search.

__process_nodes_bm25

__process_nodes_bm25(
    items: tuple[ndarray, ndarray],
    missing_ids: list[str],
    dictionary_id_bm25: dict[str, int],
    bm25_retriever: BM25,
    query_tokens: Tokenized,
)

Processes the retrieval results of the full-text retriever for the purpose of hybrid search (augment the results with missing IDs of documents retrieved from the vector index).

Args

items (tuple[ndarray, ndarray]): Retrieval results from the full-text retriever.
missing_ids (list[str]): The IDs available in the retrieval results of the vector retriever but not full-text retriever.
dictionary_id_bm25 (dict[str, int]): The table-document associations within the full-text retriever.
bm25_retriever (BM25): The full-text retriever.
query_tokens (Tokenized): The tokenized query.

Returns

dict[str, tuple[float, str]]: The processed results representing the score and document of each document ID.

__process_nodes_vec

__process_nodes_vec(
    items: QueryResult,
    missing_ids: list[str],
    collection: Collection,
    query_embedding: list[float],
)

Processes the retrieval results of the vector retriever for the purpose of hybrid search (augment the results with missing IDs of documents retrieved from the full-text index).

Args

items (QueryResult): Retrieval results from the vector retriever.
missing_ids (list[str]): The IDs available in the retrieval results of the full-text retriever but not vector retriever.
collection (dict[str, int]): The vector retriever
query_embedding (list[float]): The embedding of the query.

Returns

dict[str, tuple[float, str]]: The processed results representing the score and document of each document ID.

__rerank

__rerank(
    nodes: list[tuple[str, float, str]], query: str
) -> list[tuple[str, float, str]]

Perform re-ranking of documents against the query. Basically, the LLM Judge classifies whether a document is relevant or not against the query.

Args

nodes (list[tuple[str, float, str]]): The list of tuples, each of which consists of ID, relevance score, and document, resulted from the hybrid search mechanism.
query (str): The query.

Returns

dict[str, tuple[float, str]]: The re-ranked results.

__get_relevance_prompt

__get_relevance_prompt(desc: str, desc_type: str, query: str)

Returns relevance prompts for re-ranking purposes. The prompt format is slightly different between content summaries and context (metadata).

Args

desc (str): The description of a table, which is either a content summary or context (metadata).
desc_type (str): The description type: content or context.
query (str): The query to be compared against.

Returns

str: A relevance prompt.