QueryProcessor
QueryProcessor
Processes queries against generated hybrid indexes.
This class provides a method to retrieve tables from hybrid indexes, helping people find the relevant tables for their tasks.
Attributes
- pipe (
OpenAI | TextGenerationPipeline): The LLM pipeline for inference. - embedding_model (
OpenAI | SentenceTransformer): The model used for text embeddings. - stemmer (
Stemmer): A stemming tool used for text normalization. - index_path (
str): Path to the directory where indexes are stored. - vector_index_path (
str): Path for vector-based indexing. - fulltext_index_path (
str): Path for full-text search indexing.
__init__
__init__(
llm: OpenAI | TextGenerationPipeline,
embed_model: OpenAI | SentenceTransformer,
index_path: str,
)
query
query(
index_name: str,
queries: str | list[str],
k: int = 1,
n: int = 5,
alpha: float = 0.5,
) -> str
Retrieves tables for the given queries against the index index_name.
Args
- index_name (
str): The name of the index to be retrieved against. - queries (
str | list[str]): The query of list of queries to be executed. - k (
int): The number of documents associated with the tables to be retrieved. - n (
int): The multiplicative factor ofkto pool more relevant documents for the hybrid retrieval process. - alpha (
float): The weighting factor of the vector and full-text retrievers within a hybrid index. Loweralphagives more weight to the vector retriever.
Returns
str: A JSON string representing the result of the process (Response).
__get_retrievers
__get_retrievers(index_name: str) -> tuple[Collection, bm25s.BM25]
Get both vector and full-text retrievers of the index index_name.
Args
- index_name (
str): The name of the hybrid index.
Returns
tuple[Collection, bm25s.BM25]: The vector and full-text retrievers.
__hybrid_retriever
__hybrid_retriever(
bm25_retriever: BM25,
vec_retriever: Collection,
bm25_res: tuple[ndarray, ndarray],
vec_res: QueryResult,
k: int,
query: str,
alpha: float,
query_tokens: Tokenized,
query_embedding: list[float],
) -> list[tuple[str, float, str]]
Generates a hybrid index with name index_name for a given table_ids.
Args
- bm25_retriever (
BM25): The full-text retriever within the hybrid index. - vec_retriever (
Collection): The vector retriever within the hybrid index. - bm25_res (
tuple[ndarray, ndarray]): Retrieval results from the full-text retriever. - vec_res (
QueryResult): Retrieval results from the vector retriever. - k (
int): The number of documents retrieved from both retrievers. - query (
str): The query. - alpha (
float): The weighting factor of the vector and full-text retrievers within a hybrid index. Loweralphagives more weight to the vector retriever. - query_tokens (
Tokenized): The tokenized query. - query_embeddings (
list[float]): The embedding of the query
Returns
list[tuple[str, float, str]]: The result of the hybrid search.
__process_nodes_bm25
__process_nodes_bm25(
items: tuple[ndarray, ndarray],
missing_ids: list[str],
dictionary_id_bm25: dict[str, int],
bm25_retriever: BM25,
query_tokens: Tokenized,
)
Processes the retrieval results of the full-text retriever for the purpose of hybrid search (augment the results with missing IDs of documents retrieved from the vector index).
Args
- items (
tuple[ndarray, ndarray]): Retrieval results from the full-text retriever. - missing_ids (
list[str]): The IDs available in the retrieval results of the vector retriever but not full-text retriever. - dictionary_id_bm25 (
dict[str, int]): The table-document associations within the full-text retriever. - bm25_retriever (
BM25): The full-text retriever. - query_tokens (
Tokenized): The tokenized query.
Returns
dict[str, tuple[float, str]]: The processed results representing the score and document of each document ID.
__process_nodes_vec
__process_nodes_vec(
items: QueryResult,
missing_ids: list[str],
collection: Collection,
query_embedding: list[float],
)
Processes the retrieval results of the vector retriever for the purpose of hybrid search (augment the results with missing IDs of documents retrieved from the full-text index).
Args
- items (
QueryResult): Retrieval results from the vector retriever. - missing_ids (
list[str]): The IDs available in the retrieval results of the full-text retriever but not vector retriever. - collection (
dict[str, int]): The vector retriever - query_embedding (
list[float]): The embedding of the query.
Returns
dict[str, tuple[float, str]]: The processed results representing the score and document of each document ID.
__rerank
__rerank(
nodes: list[tuple[str, float, str]], query: str
) -> list[tuple[str, float, str]]
Perform re-ranking of documents against the query. Basically, the LLM
Judge classifies whether a document is relevant or not against the query.
Args
- nodes (
list[tuple[str, float, str]]): The list of tuples, each of which consists of ID, relevance score, and document, resulted from the hybrid search mechanism. - query (
str): The query.
Returns
dict[str, tuple[float, str]]: The re-ranked results.
__get_relevance_prompt
__get_relevance_prompt(desc: str, desc_type: str, query: str)
Returns relevance prompts for re-ranking purposes. The prompt format is slightly different between content summaries and context (metadata).
Args
- desc (
str): The description of a table, which is either a content summary or context (metadata). - desc_type (
str): The description type: content or context. - query (
str): The query to be compared against.
Returns
str: A relevance prompt.