IndexGenerator
IndexGenerator
Generates indexes for content summaries and context (metadata) associated with tables.
This class provides a method to create hybrid---vector & full-text---indexes that helps efficiently organize information to be queried later.
Attributes
- embedding_model (
OpenAI | SentenceTransformer): The model used for text embeddings. - db_path (
str): Path to the database file for retrieving content summaries & context. - index_path (
str): Path to the directory where indexes are stored. - stemmer (
Stemmer): A stemming tool used for text normalization. - vector_index_path (
str): Path for vector-based indexing. - fulltext_index_path (
str): Path for full-text search indexing. - EMBEDDING_MAX_TOKENS (
int): The maximum number of tokens the embedding model supports (hard-coded to 512 for local models and 8191 for OpenAI models).
__init__
__init__(
embed_model: OpenAI | SentenceTransformer, db_path: str, index_path: str
)
generate_index
generate_index(
index_name: str, table_ids: list[str] | tuple[str] = None
) -> str
Generates a hybrid index with name index_name for a given table_ids.
Args
- index_name (
str): The name of the index to be generated. - table_ids (
list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
Returns
str: A JSON string representing the result of the process (Response).
__generate_vector_index
__generate_vector_index(index_name: str, chroma_client: ClientAPI) -> Response
Generates a vector index with name index_name using ChromaDB-Deterministic.
Args
- index_name (
str): The name of the index to be generated. - chroma_client (
ClientAPI): A Client API forChromaDB-Deterministic.
Returns
Response: AResponseobject of the process.
__insert_documents_to_vector_index
__insert_documents_to_vector_index(
index_id: int,
table_ids: list[str] | tuple[str],
chroma_collection: Collection,
) -> Response
Inserts documents (related to the tables associated with table_ids) into
a vector index.
Args
- index_id (
int): The database ID of the vector index. - table_ids (
list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata). - chroma_collection (
Collection): The vector index.
Returns
Response: AResponseobject of the process.
__generate_fulltext_index
__generate_fulltext_index(index_name: str)
Generates a full-text index with name index_name using BM25s.
Args
- index_name (
str): The name of the index to be generated.
Returns
Response: AResponseobject of the process.
__insert_documents_to_fulltext_index
__insert_documents_to_fulltext_index(
index_id: int, table_ids: list | tuple, retriever: BM25
)
Inserts documents (related to the tables associated with table_ids) into
a full-text index.
Args
- index_id (
int): The database ID of the full-text index. - table_ids (
list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata). - retriever (
BM25): The full-text index.
Returns
Response: AResponseobject of the process.
__get_table_contexts
__get_table_contexts(table_id: str) -> list[tuple[str, str]]
Retrieves all contexts (metadata) associated with table_id from the
database.
Args
- table_id (
str): The ID of the table in the database.
Returns
list[tuple[str, str]]: The contexts and their associated IDs.
__merge_contexts
__merge_contexts(contexts: list[tuple[str, str]]) -> list[str]
__get_content_summaries
__get_content_summaries(
table_id: str, summary_type: SummaryType
) -> list[tuple[str, str]]
Retrieves all content summaries associated with table_id from the
database.
Args
- table_id (
str): The ID of the table in the database. - summary_type (
SummaryType): The type of summaries to be retrieved (either column narration or row sample).
Returns
list[tuple[str, str]]: The content summaries and their associated IDs.