IndexGenerator

Generates indexes for content summaries and context (metadata) associated with tables.

This class provides a method to create hybrid---vector & full-text---indexes that helps efficiently organize information to be queried later.

embedding_model (OpenAI | SentenceTransformer): The model used for text embeddings.
db_path (str): Path to the database file for retrieving content summaries & context.
index_path (str): Path to the directory where indexes are stored.
stemmer (Stemmer): A stemming tool used for text normalization.
vector_index_path (str): Path for vector-based indexing.
fulltext_index_path (str): Path for full-text search indexing.
EMBEDDING_MAX_TOKENS (int): The maximum number of tokens the embedding model supports (hard-coded to 512 for local models and 8191 for OpenAI models).

__init__(
    embed_model: OpenAI | SentenceTransformer, db_path: str, index_path: str
)

generate_index(
    index_name: str, table_ids: list[str] | tuple[str] = None
) -> str

Generates a hybrid index with name index_name for a given table_ids.

index_name (str): The name of the index to be generated.
table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).

__generate_vector_index(index_name: str, chroma_client: ClientAPI) -> Response

Generates a vector index with name index_name using ChromaDB-Deterministic.

__insert_documents_to_vector_index(
    index_id: int,
    table_ids: list[str] | tuple[str],
    chroma_collection: Collection,
) -> Response

Inserts documents (related to the tables associated with table_ids) into a vector index.

index_id (int): The database ID of the vector index.
table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
chroma_collection (Collection): The vector index.

__generate_fulltext_index(index_name: str)

Generates a full-text index with name index_name using BM25s.

__insert_documents_to_fulltext_index(
    index_id: int, table_ids: list | tuple, retriever: BM25
)

Inserts documents (related to the tables associated with table_ids) into a full-text index.

index_id (int): The database ID of the full-text index.
table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
retriever (BM25): The full-text index.

__get_table_contexts(table_id: str) -> list[tuple[str, str]]

Retrieves all contexts (metadata) associated with table_id from the database.

__merge_contexts(contexts: list[tuple[str, str]]) -> list[str]

__get_content_summaries(
    table_id: str, summary_type: SummaryType
) -> list[tuple[str, str]]

Retrieves all content summaries associated with table_id from the database.

table_id (str): The ID of the table in the database.
summary_type (SummaryType): The type of summaries to be retrieved (either column narration or row sample).