Summarizer

Summarizes indexed tables in the database.

This class provides a method to summarize indexed tables in the database to represent them for retrieval purposes.

Attributes

pipe (OpenAI | TextGenerationPipeline): The LLM pipeline for inference.
embedding_model (OpenAI | SentenceTransformer): The model used for text embeddings.
db_path (str): Path to the database file for retrieving content summaries & context.
MAX_LLM_BATCH_SIZE (int): The upper bound of batch size value to explore dynamically for LLM inference.
EMBEDDING_MAX_TOKENS (int): The maximum number of tokens the embedding model supports (hard-coded to 512 for local models and 8191 for OpenAI models).

init

__init__(
    llm: OpenAI | TextGenerationPipeline,
    embed_model: OpenAI | SentenceTransformer,
    db_path: str,
    max_llm_batch_size: int = 50,
)

summarize

summarize(table_id: str = None) -> str

Summarizes the contents of all unsummarized tables or a specific table if table_id is provided.

Args

table_id (str): The specific table ID to be summarized.

Returns

str: A JSON string representing the result of the process (Response).

__summarize_table_by_id

__summarize_table_by_id(table_id: str) -> list[str]

Summarizes the contents of a single table: table_id.

Args

table_id (str): The specific table ID to be summarized.

Returns

list[str]: The database IDs of the resulting summaries for table table_id.

__batch_summarize_tables

__batch_summarize_tables(table_ids: list[str]) -> list[str]

Summarizes the contents of tables table_ids.

Args

table_ids (list[str]): The specific table IDs to be summarized.

Returns

list[str]: The database IDs of the resulting summaries for the tables.

__generate_column_narrations

__generate_column_narrations(df: DataFrame) -> list[str]

Generate column narrations for a single dataframe (for quick local testing). This method may be removed in the future.

__batch_generate_column_narrations

__batch_generate_column_narrations(
    table_ids: list[str],
) -> dict[str, list[str]]

Generates column narrations for the tables table_ids.

Args

table_ids (list[str]): The specific table IDs to be narrated.

Returns

dict[str, list[str]]: The column narrations of the tables.

__get_col_narration_prompt

__get_col_narration_prompt(columns: str, column: str) -> str

Returns the prompt to narrate a column of a table given other columns in the table.

Args

columns (str): A concatenation of columns.
column (str): A specific column (part of columns) to be narrated.

Returns

str: The prompt to narrate column.

__get_optimal_batch_size

__get_optimal_batch_size(conversations: list[dict[str, str]]) -> int

Explores the optimal batch size value (bounded between 1 and MAX_LLM_BATCH_SIZE) for conversations to be set for the LLM pipeline using binary search.

Args

conversations (list[dict[str, str]]): The list of prompts to narrate columns of tables.

Returns

int: The optimal batch size.

__is_fit_in_memory

__is_fit_in_memory(
    conversations: list[dict[str, str]], batch_size: int
) -> bool

Checks if conversations with the given batch_size fits in memory when running inference using the LLM pipeline.

Args

conversations (list[dict[str, str]]): The list of prompts to narrate columns of tables.
batch_size (int): The specific batch size value to test.

Returns

bool: The conversations fit or not in memory.

__get_special_indices

__get_special_indices(prompts: list[str], batch_size: int) -> list[int]

Sorts prompts in a specific manner to try to balance the memory load for each batch of LLM inferences.

Args

prompts (list[str]): The list of prompts to narrate columns of tables.
batch_size (int): The optimal batch size value to be used.

Returns

list[int]: The "special indices" for the prompts given the batch size.

__block_column_narrations

__block_column_narrations(column_narrations: list[str]) -> list[str]

Convert column narrations into blocks to try to group multiple narrations as much as possible, reducing the amount of embeddings that need to be produced.

Args

column_narrations (list[str]): The list of column narrations for a set of tables.

Returns

list[str]: The blocked version of column_narrations.

__generate_row_samples

__generate_row_samples(df: DataFrame) -> list[str]

Generates row samples for the table df. The process is deterministic because we set the sampling seed to be the value 0.

Args

df (pd.DataFrame): The specific table to sample rows from.

Returns

list[str]: The sampled rows

__block_row_samples

__block_row_samples(row_samples: list[str]) -> list[str]

Convert row samples into blocks to try to group multiple samples as much as possible, reducing the amount of embeddings that need to be produced.

Args

row_samples (list[str]): The list of row samples for a set of tables.

Returns

list[str]: The blocked version of row_samples.