Skip to content

IndexGenerator

IndexGenerator

Generates indexes for content summaries and context (metadata) associated with tables.

This class provides a method to create hybrid---vector & full-text---indexes that helps efficiently organize information to be queried later.

Attributes
  • embedding_model (OpenAI | SentenceTransformer): The model used for text embeddings.
  • db_path (str): Path to the database file for retrieving content summaries & context.
  • index_path (str): Path to the directory where indexes are stored.
  • stemmer (Stemmer): A stemming tool used for text normalization.
  • vector_index_path (str): Path for vector-based indexing.
  • fulltext_index_path (str): Path for full-text search indexing.
  • EMBEDDING_MAX_TOKENS (int): The maximum number of tokens the embedding model supports (hard-coded to 512 for local models and 8191 for OpenAI models).

__init__

__init__(
    embed_model: OpenAI | SentenceTransformer, db_path: str, index_path: str
)

generate_index

generate_index(
    index_name: str, table_ids: list[str] | tuple[str] = None
) -> str

Generates a hybrid index with name index_name for a given table_ids.

Args
  • index_name (str): The name of the index to be generated.
  • table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
Returns
  • str: A JSON string representing the result of the process (Response).

__generate_vector_index

__generate_vector_index(index_name: str, chroma_client: ClientAPI) -> Response

Generates a vector index with name index_name using ChromaDB-Deterministic.

Args
  • index_name (str): The name of the index to be generated.
  • chroma_client (ClientAPI): A Client API for ChromaDB-Deterministic.
Returns
  • Response: A Response object of the process.

__insert_documents_to_vector_index

__insert_documents_to_vector_index(
    index_id: int,
    table_ids: list[str] | tuple[str],
    chroma_collection: Collection,
) -> Response

Inserts documents (related to the tables associated with table_ids) into a vector index.

Args
  • index_id (int): The database ID of the vector index.
  • table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
  • chroma_collection (Collection): The vector index.
Returns
  • Response: A Response object of the process.

__generate_fulltext_index

__generate_fulltext_index(index_name: str)

Generates a full-text index with name index_name using BM25s.

Args
  • index_name (str): The name of the index to be generated.
Returns
  • Response: A Response object of the process.

__insert_documents_to_fulltext_index

__insert_documents_to_fulltext_index(
    index_id: int, table_ids: list | tuple, retriever: BM25
)

Inserts documents (related to the tables associated with table_ids) into a full-text index.

Args
  • index_id (int): The database ID of the full-text index.
  • table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
  • retriever (BM25): The full-text index.
Returns
  • Response: A Response object of the process.

__get_table_contexts

__get_table_contexts(table_id: str) -> list[tuple[str, str]]

Retrieves all contexts (metadata) associated with table_id from the database.

Args
  • table_id (str): The ID of the table in the database.
Returns
  • list[tuple[str, str]]: The contexts and their associated IDs.

__merge_contexts

__merge_contexts(contexts: list[tuple[str, str]]) -> list[str]

__get_content_summaries

__get_content_summaries(
    table_id: str, summary_type: SummaryType
) -> list[tuple[str, str]]

Retrieves all content summaries associated with table_id from the database.

Args
  • table_id (str): The ID of the table in the database.
  • summary_type (SummaryType): The type of summaries to be retrieved (either column narration or row sample).
Returns
  • list[tuple[str, str]]: The content summaries and their associated IDs.