QueryProcessor
QueryProcessor
Processes queries against generated hybrid indexes.
This class provides a method to retrieve tables from hybrid indexes, helping people find the relevant tables for their tasks.
Attributes
- pipe (
OpenAI | TextGenerationPipeline
): The LLM pipeline for inference. - embedding_model (
OpenAI | SentenceTransformer
): The model used for text embeddings. - stemmer (
Stemmer
): A stemming tool used for text normalization. - index_path (
str
): Path to the directory where indexes are stored. - vector_index_path (
str
): Path for vector-based indexing. - fulltext_index_path (
str
): Path for full-text search indexing.
__init__
__init__(
llm: OpenAI | TextGenerationPipeline,
embed_model: OpenAI | SentenceTransformer,
index_path: str,
)
query
query(
index_name: str,
queries: str | list[str],
k: int = 1,
n: int = 5,
alpha: float = 0.5,
) -> str
Retrieves tables for the given queries
against the index index_name
.
Args
- index_name (
str
): The name of the index to be retrieved against. - queries (
str | list[str]
): The query of list of queries to be executed. - k (
int
): The number of documents associated with the tables to be retrieved. - n (
int
): The multiplicative factor ofk
to pool more relevant documents for the hybrid retrieval process. - alpha (
float
): The weighting factor of the vector and full-text retrievers within a hybrid index. Loweralpha
gives more weight to the vector retriever.
Returns
str
: A JSON string representing the result of the process (Response
).
__get_retrievers
__get_retrievers(index_name: str) -> tuple[Collection, bm25s.BM25]
Get both vector and full-text retrievers of the index index_name
.
Args
- index_name (
str
): The name of the hybrid index.
Returns
tuple[Collection, bm25s.BM25]
: The vector and full-text retrievers.
__hybrid_retriever
__hybrid_retriever(
bm25_retriever: BM25,
vec_retriever: Collection,
bm25_res: tuple[ndarray, ndarray],
vec_res: QueryResult,
k: int,
query: str,
alpha: float,
query_tokens: Tokenized,
query_embedding: list[float],
) -> list[tuple[str, float, str]]
Generates a hybrid index with name index_name
for a given table_ids
.
Args
- bm25_retriever (
BM25
): The full-text retriever within the hybrid index. - vec_retriever (
Collection
): The vector retriever within the hybrid index. - bm25_res (
tuple[ndarray, ndarray]
): Retrieval results from the full-text retriever. - vec_res (
QueryResult
): Retrieval results from the vector retriever. - k (
int
): The number of documents retrieved from both retrievers. - query (
str
): The query. - alpha (
float
): The weighting factor of the vector and full-text retrievers within a hybrid index. Loweralpha
gives more weight to the vector retriever. - query_tokens (
Tokenized
): The tokenized query. - query_embeddings (
list[float]
): The embedding of the query
Returns
list[tuple[str, float, str]]
: The result of the hybrid search.
__process_nodes_bm25
__process_nodes_bm25(
items: tuple[ndarray, ndarray],
missing_ids: list[str],
dictionary_id_bm25: dict[str, int],
bm25_retriever: BM25,
query_tokens: Tokenized,
)
Processes the retrieval results of the full-text retriever for the purpose of hybrid search (augment the results with missing IDs of documents retrieved from the vector index).
Args
- items (
tuple[ndarray, ndarray]
): Retrieval results from the full-text retriever. - missing_ids (
list[str]
): The IDs available in the retrieval results of the vector retriever but not full-text retriever. - dictionary_id_bm25 (
dict[str, int]
): The table-document associations within the full-text retriever. - bm25_retriever (
BM25
): The full-text retriever. - query_tokens (
Tokenized
): The tokenized query.
Returns
dict[str, tuple[float, str]]
: The processed results representing the score and document of each document ID.
__process_nodes_vec
__process_nodes_vec(
items: QueryResult,
missing_ids: list[str],
collection: Collection,
query_embedding: list[float],
)
Processes the retrieval results of the vector retriever for the purpose of hybrid search (augment the results with missing IDs of documents retrieved from the full-text index).
Args
- items (
QueryResult
): Retrieval results from the vector retriever. - missing_ids (
list[str]
): The IDs available in the retrieval results of the full-text retriever but not vector retriever. - collection (
dict[str, int]
): The vector retriever - query_embedding (
list[float]
): The embedding of the query.
Returns
dict[str, tuple[float, str]]
: The processed results representing the score and document of each document ID.
__rerank
__rerank(
nodes: list[tuple[str, float, str]], query: str
) -> list[tuple[str, float, str]]
Perform re-ranking of documents against the query. Basically, the LLM
Judge
classifies whether a document is relevant or not against the query.
Args
- nodes (
list[tuple[str, float, str]]
): The list of tuples, each of which consists of ID, relevance score, and document, resulted from the hybrid search mechanism. - query (
str
): The query.
Returns
dict[str, tuple[float, str]]
: The re-ranked results.
__get_relevance_prompt
__get_relevance_prompt(desc: str, desc_type: str, query: str)
Returns relevance prompts for re-ranking purposes. The prompt format is slightly different between content summaries and context (metadata).
Args
- desc (
str
): The description of a table, which is either a content summary or context (metadata). - desc_type (
str
): The description type: content or context. - query (
str
): The query to be compared against.
Returns
str
: A relevance prompt.