Summarizer
Summarizer
Summarizes indexed tables in the database.
This class provides a method to summarize indexed tables in the database to represent them for retrieval purposes.
Attributes
- pipe (
OpenAI | TextGenerationPipeline
): The LLM pipeline for inference. - embedding_model (
OpenAI | SentenceTransformer
): The model used for text embeddings. - db_path (
str
): Path to the database file for retrieving content summaries & context. - MAX_LLM_BATCH_SIZE (
int
): The upper bound of batch size value to explore dynamically for LLM inference. - EMBEDDING_MAX_TOKENS (
int
): The maximum number of tokens the embedding model supports (hard-coded to 512 for local models and 8191 for OpenAI models).
__init__
__init__(
llm: OpenAI | TextGenerationPipeline,
embed_model: OpenAI | SentenceTransformer,
db_path: str,
max_llm_batch_size: int = 50,
)
summarize
summarize(table_id: str = None) -> str
Summarizes the contents of all unsummarized tables or a specific table
if table_id
is provided.
Args
- table_id (
str
): The specific table ID to be summarized.
Returns
str
: A JSON string representing the result of the process (Response
).
__summarize_table_by_id
__summarize_table_by_id(table_id: str) -> list[str]
Summarizes the contents of a single table: table_id
.
Args
- table_id (
str
): The specific table ID to be summarized.
Returns
list[str]
: The database IDs of the resulting summaries for tabletable_id
.
__batch_summarize_tables
__batch_summarize_tables(table_ids: list[str]) -> list[str]
Summarizes the contents of tables table_ids
.
Args
- table_ids (
list[str]
): The specific table IDs to be summarized.
Returns
list[str]
: The database IDs of the resulting summaries for the tables.
__generate_column_narrations
__generate_column_narrations(df: DataFrame) -> list[str]
Generate column narrations for a single dataframe (for quick local testing). This method may be removed in the future.
__batch_generate_column_narrations
__batch_generate_column_narrations(
table_ids: list[str],
) -> dict[str, list[str]]
Generates column narrations for the tables table_ids
.
Args
- table_ids (
list[str]
): The specific table IDs to be narrated.
Returns
dict[str, list[str]]
: The column narrations of the tables.
__get_col_narration_prompt
__get_col_narration_prompt(columns: str, column: str) -> str
Returns the prompt to narrate a column of a table given other columns in the table.
Args
- columns (
str
): A concatenation ofcolumns
. - column (
str
): A specific column (part ofcolumns
) to be narrated.
Returns
str
: The prompt to narratecolumn
.
__get_optimal_batch_size
__get_optimal_batch_size(conversations: list[dict[str, str]]) -> int
Explores the optimal batch size value (bounded between 1 and
MAX_LLM_BATCH_SIZE
) for conversations
to be set for the LLM pipeline
using binary search.
Args
- conversations (
list[dict[str, str]]
): The list of prompts to narrate columns of tables.
Returns
int
: The optimal batch size.
__is_fit_in_memory
__is_fit_in_memory(
conversations: list[dict[str, str]], batch_size: int
) -> bool
Checks if conversations
with the given batch_size
fits in memory when
running inference using the LLM pipeline.
Args
- conversations (
list[dict[str, str]]
): The list of prompts to narrate columns of tables. - batch_size (
int
): The specific batch size value to test.
Returns
bool
: Theconversations
fit or not in memory.
__get_special_indices
__get_special_indices(prompts: list[str], batch_size: int) -> list[int]
Sorts prompts
in a specific manner to try to balance the memory load
for each batch of LLM inferences.
Args
- prompts (
list[str]
): The list of prompts to narrate columns of tables. - batch_size (
int
): The optimal batch size value to be used.
Returns
list[int]
: The "special indices" for theprompts
given thebatch size
.
__block_column_narrations
__block_column_narrations(column_narrations: list[str]) -> list[str]
Convert column narrations into blocks to try to group multiple narrations as much as possible, reducing the amount of embeddings that need to be produced.
Args
- column_narrations (
list[str]
): The list of column narrations for a set of tables.
Returns
list[str]
: The blocked version ofcolumn_narrations
.
__generate_row_samples
__generate_row_samples(df: DataFrame) -> list[str]
Generates row samples for the table df
. The process is deterministic
because we set the sampling seed to be the value 0.
Args
- df (
pd.DataFrame
): The specific table to sample rows from.
Returns
list[str]
: The sampled rows
__block_row_samples
__block_row_samples(row_samples: list[str]) -> list[str]
Convert row samples into blocks to try to group multiple samples as much as possible, reducing the amount of embeddings that need to be produced.
Args
- row_samples (
list[str]
): The list of row samples for a set of tables.
Returns
list[str]
: The blocked version ofrow_samples
.