IndexGenerator
IndexGenerator
Generates indexes for content summaries and context (metadata) associated with tables.
This class provides a method to create hybrid---vector & full-text---indexes that helps efficiently organize information to be queried later.
Attributes
- embedding_model (
OpenAI | SentenceTransformer
): The model used for text embeddings. - db_path (
str
): Path to the database file for retrieving content summaries & context. - index_path (
str
): Path to the directory where indexes are stored. - stemmer (
Stemmer
): A stemming tool used for text normalization. - vector_index_path (
str
): Path for vector-based indexing. - fulltext_index_path (
str
): Path for full-text search indexing. - EMBEDDING_MAX_TOKENS (
int
): The maximum number of tokens the embedding model supports (hard-coded to 512 for local models and 8191 for OpenAI models).
__init__
__init__(
embed_model: OpenAI | SentenceTransformer, db_path: str, index_path: str
)
generate_index
generate_index(
index_name: str, table_ids: list[str] | tuple[str] = None
) -> str
Generates a hybrid index with name index_name
for a given table_ids
.
Args
- index_name (
str
): The name of the index to be generated. - table_ids (
list[str] | tuple[str]
): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
Returns
str
: A JSON string representing the result of the process (Response
).
__generate_vector_index
__generate_vector_index(index_name: str, chroma_client: ClientAPI) -> Response
Generates a vector index with name index_name
using ChromaDB-Deterministic
.
Args
- index_name (
str
): The name of the index to be generated. - chroma_client (
ClientAPI
): A Client API forChromaDB-Deterministic
.
Returns
Response
: AResponse
object of the process.
__insert_documents_to_vector_index
__insert_documents_to_vector_index(
index_id: int,
table_ids: list[str] | tuple[str],
chroma_collection: Collection,
) -> Response
Inserts documents (related to the tables associated with table_ids
) into
a vector index.
Args
- index_id (
int
): The database ID of the vector index. - table_ids (
list[str] | tuple[str]
): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata). - chroma_collection (
Collection
): The vector index.
Returns
Response
: AResponse
object of the process.
__generate_fulltext_index
__generate_fulltext_index(index_name: str)
Generates a full-text index with name index_name
using BM25s
.
Args
- index_name (
str
): The name of the index to be generated.
Returns
Response
: AResponse
object of the process.
__insert_documents_to_fulltext_index
__insert_documents_to_fulltext_index(
index_id: int, table_ids: list | tuple, retriever: BM25
)
Inserts documents (related to the tables associated with table_ids
) into
a full-text index.
Args
- index_id (
int
): The database ID of the full-text index. - table_ids (
list[str] | tuple[str]
): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata). - retriever (
BM25
): The full-text index.
Returns
Response
: AResponse
object of the process.
__get_table_contexts
__get_table_contexts(table_id: str) -> list[tuple[str, str]]
Retrieves all contexts (metadata) associated with table_id
from the
database.
Args
- table_id (
str
): The ID of the table in the database.
Returns
list[tuple[str, str]]
: The contexts and their associated IDs.
__merge_contexts
__merge_contexts(contexts: list[tuple[str, str]]) -> list[str]
__get_content_summaries
__get_content_summaries(
table_id: str, summary_type: SummaryType
) -> list[tuple[str, str]]
Retrieves all content summaries associated with table_id
from the
database.
Args
- table_id (
str
): The ID of the table in the database. - summary_type (
SummaryType
): The type of summaries to be retrieved (either column narration or row sample).
Returns
list[tuple[str, str]]
: The content summaries and their associated IDs.