Overview

Pneuma

The entry point of Pneuma, combining all modules for ther purpose of LLM-based table discovery.

This class provides end-to-end methods from indexing tables (and their metadata, if any) to retrieving tables given users' queries.

Attributes

out_path (str): The output folder of Pneuma.
db_path (str): The database path within the output folder of Pneuma.
index_location (str): The index path within the output folder of Pneuma.
hf_token (str): A HuggingFace User Access Tokens.
openai_api_key (str): An OpenAI API key.
use_local_model (bool): The option to use local or third-party models (for now, OpenAI models only as both LLM and embedding model).
llm_path (str): The path or name of a local LLM from HuggingFace.
embed_path (str): The path or name of a local embedding model from HuggingFace.
max_llm_batch_size (int): Maximum batch size for the dynamic batch size selector to explore.
registrar (Registrar): The dataset regisration module.
summarizer (Summarizer): The dataset summarizer module.
index_generator (IndexGenerator): The index generator module.
query_processor (QueryProcessor): The query processor module.
llm (OpenAI | TextGenerationPipeline): The actual LLM (lazily initialized).
embed_model (OpenAI | SentenceTransformer): The actual embedding model (lazily initialized).

init

__init__(
    out_path: Optional[str] = None,
    hf_token: Optional[str] = None,
    openai_api_key: Optional[str] = None,
    use_local_model: bool = True,
    llm_path: str = "Qwen/Qwen2.5-7B-Instruct",
    embed_path: str = "BAAI/bge-base-en-v1.5",
    max_llm_batch_size: int = 50,
)

setup

setup() -> str

Setup Pneuma through its Registrar module.

add_tables

add_tables(
    path: str,
    creator: str,
    source: str = "file",
    s3_region: str = None,
    s3_access_key: str = None,
    s3_secret_access_key: str = None,
    accept_duplicates: bool = False,
) -> str

Registers tables into the database by utilizing the Registrar module.

Returns

str: A JSON string representing the result of the process (Response).

add_metadata

add_metadata(metadata_path: str, table_id: str = '') -> str

Registers metadata into the database by utilizing the Registrar module.

Returns

str: A JSON string representing the result of the process (Response).

summarize

summarize(table_id: str = None) -> str

Summarizes the contents of all unsummarized tables or a specific table if table_id is provided using the Summarizer module.

Args

table_id (str): The specific table ID to be summarized.

Returns

str: A JSON string representing the result of the process (Response).

generate_index

generate_index(
    index_name: str, table_ids: list[str] | tuple[str] = None
) -> str

Generates a hybrid index with name index_name for a given table_ids by utilizing the IndexGenerator module.

Args

index_name (str): The name of the index to be generated.
table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).

Returns

str: A JSON string representing the result of the process (Response).

query_index

query_index(
    index_name: str,
    queries: list[str] | str,
    k: int = 1,
    n: int = 5,
    alpha: int = 0.5,
) -> str

Retrieves tables for the given queries against the index index_name by utilizing the QueryProcessor module.

Args

index_name (str): The name of the index to be retrieved against.
queries (str | list[str]): The query of list of queries to be executed.
k (int): The number of documents associated with the tables to be retrieved.
n (int): The multiplicative factor of k to pool more relevant documents for the hybrid retrieval process.
alpha (float): The weighting factor of the vector and full-text retrievers within a hybrid index. Lower alpha gives more weight to the vector retriever.

Returns

str: A JSON string representing the result of the process (Response).

__hf_login

__hf_login()

Logs into Hugging Face if a token is provided.

__init_registrar

__init_registrar()

Initializes the Registrar module.

__init_summarizer

__init_summarizer()

Initializes the Summarizer module.

__init_index_generator

__init_index_generator()

Initializes the IndexGenerator module.

__init_query_processor

__init_query_processor()

Initializes the QueryProcessor module.

__init_llm

__init_llm()

Initializes the LLM.

__init_embed_model

__init_embed_model()

Initializes the embedding model.