Skip to content

Overview

Pneuma

The entry point of Pneuma, combining all modules for ther purpose of LLM-based table discovery.

This class provides end-to-end methods from indexing tables (and their metadata, if any) to retrieving tables given users' queries.

Attributes
  • out_path (str): The output folder of Pneuma.
  • db_path (str): The database path within the output folder of Pneuma.
  • index_location (str): The index path within the output folder of Pneuma.
  • hf_token (str): A HuggingFace User Access Tokens.
  • openai_api_key (str): An OpenAI API key.
  • use_local_model (bool): The option to use local or third-party models (for now, OpenAI models only as both LLM and embedding model).
  • llm_path (str): The path or name of a local LLM from HuggingFace.
  • embed_path (str): The path or name of a local embedding model from HuggingFace.
  • max_llm_batch_size (int): Maximum batch size for the dynamic batch size selector to explore.
  • registrar (Registrar): The dataset regisration module.
  • summarizer (Summarizer): The dataset summarizer module.
  • index_generator (IndexGenerator): The index generator module.
  • query_processor (QueryProcessor): The query processor module.
  • llm (OpenAI | TextGenerationPipeline): The actual LLM (lazily initialized).
  • embed_model (OpenAI | SentenceTransformer): The actual embedding model (lazily initialized).

__init__

__init__(
    out_path: Optional[str] = None,
    hf_token: Optional[str] = None,
    openai_api_key: Optional[str] = None,
    use_local_model: bool = True,
    llm_path: str = "Qwen/Qwen2.5-7B-Instruct",
    embed_path: str = "BAAI/bge-base-en-v1.5",
    max_llm_batch_size: int = 50,
)

setup

setup() -> str

Setup Pneuma through its Registrar module.

add_tables

add_tables(
    path: str,
    creator: str,
    source: str = "file",
    s3_region: str = None,
    s3_access_key: str = None,
    s3_secret_access_key: str = None,
    accept_duplicates: bool = False,
) -> str

Registers tables into the database by utilizing the Registrar module.

Returns
  • str: A JSON string representing the result of the process (Response).

add_metadata

add_metadata(metadata_path: str, table_id: str = '') -> str

Registers metadata into the database by utilizing the Registrar module.

Returns
  • str: A JSON string representing the result of the process (Response).

summarize

summarize(table_id: str = None) -> str

Summarizes the contents of all unsummarized tables or a specific table if table_id is provided using the Summarizer module.

Args
  • table_id (str): The specific table ID to be summarized.
Returns
  • str: A JSON string representing the result of the process (Response).

generate_index

generate_index(
    index_name: str, table_ids: list[str] | tuple[str] = None
) -> str

Generates a hybrid index with name index_name for a given table_ids by utilizing the IndexGenerator module.

Args
  • index_name (str): The name of the index to be generated.
  • table_ids (list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
Returns
  • str: A JSON string representing the result of the process (Response).

query_index

query_index(
    index_name: str,
    queries: list[str] | str,
    k: int = 1,
    n: int = 5,
    alpha: int = 0.5,
) -> str

Retrieves tables for the given queries against the index index_name by utilizing the QueryProcessor module.

Args
  • index_name (str): The name of the index to be retrieved against.
  • queries (str | list[str]): The query of list of queries to be executed.
  • k (int): The number of documents associated with the tables to be retrieved.
  • n (int): The multiplicative factor of k to pool more relevant documents for the hybrid retrieval process.
  • alpha (float): The weighting factor of the vector and full-text retrievers within a hybrid index. Lower alpha gives more weight to the vector retriever.
Returns
  • str: A JSON string representing the result of the process (Response).

__hf_login

__hf_login()

Logs into Hugging Face if a token is provided.

__init_registrar

__init_registrar()

Initializes the Registrar module.

__init_summarizer

__init_summarizer()

Initializes the Summarizer module.

__init_index_generator

__init_index_generator()

Initializes the IndexGenerator module.

__init_query_processor

__init_query_processor()

Initializes the QueryProcessor module.

__init_llm

__init_llm()

Initializes the LLM.

__init_embed_model

__init_embed_model()

Initializes the embedding model.