Overview
Pneuma
The entry point of Pneuma, combining all modules for ther purpose of LLM-based
table discovery.
This class provides end-to-end methods from indexing tables (and their metadata, if any) to retrieving tables given users' queries.
Attributes
- out_path (
str): The output folder of Pneuma. - db_path (
str): The database path within the output folder of Pneuma. - index_location (
str): The index path within the output folder of Pneuma. - hf_token (
str): A HuggingFace User Access Tokens. - openai_api_key (
str): An OpenAI API key. - use_local_model (
bool): The option to use local or third-party models (for now, OpenAI models only as both LLM and embedding model). - llm_path (
str): The path or name of a local LLM from HuggingFace. - embed_path (
str): The path or name of a local embedding model from HuggingFace. - max_llm_batch_size (
int): Maximum batch size for the dynamic batch size selector to explore. - registrar (
Registrar): The dataset regisration module. - summarizer (
Summarizer): The dataset summarizer module. - index_generator (
IndexGenerator): The index generator module. - query_processor (
QueryProcessor): The query processor module. - llm (
OpenAI | TextGenerationPipeline): The actual LLM (lazily initialized). - embed_model (
OpenAI | SentenceTransformer): The actual embedding model (lazily initialized).
__init__
__init__(
out_path: Optional[str] = None,
hf_token: Optional[str] = None,
openai_api_key: Optional[str] = None,
use_local_model: bool = True,
llm_path: str = "Qwen/Qwen2.5-7B-Instruct",
embed_path: str = "BAAI/bge-base-en-v1.5",
max_llm_batch_size: int = 50,
)
setup
setup() -> str
Setup Pneuma through its Registrar module.
add_tables
add_tables(
path: str,
creator: str,
source: str = "file",
s3_region: str = None,
s3_access_key: str = None,
s3_secret_access_key: str = None,
accept_duplicates: bool = False,
) -> str
Registers tables into the database by utilizing the Registrar module.
Returns
str: A JSON string representing the result of the process (Response).
add_metadata
add_metadata(metadata_path: str, table_id: str = '') -> str
Registers metadata into the database by utilizing the Registrar module.
Returns
str: A JSON string representing the result of the process (Response).
summarize
summarize(table_id: str = None) -> str
Summarizes the contents of all unsummarized tables or a specific table
if table_id is provided using the Summarizer module.
Args
- table_id (
str): The specific table ID to be summarized.
Returns
str: A JSON string representing the result of the process (Response).
generate_index
generate_index(
index_name: str, table_ids: list[str] | tuple[str] = None
) -> str
Generates a hybrid index with name index_name for a given table_ids
by utilizing the IndexGenerator module.
Args
- index_name (
str): The name of the index to be generated. - table_ids (
list[str] | tuple[str]): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
Returns
str: A JSON string representing the result of the process (Response).
query_index
query_index(
index_name: str,
queries: list[str] | str,
k: int = 1,
n: int = 5,
alpha: int = 0.5,
) -> str
Retrieves tables for the given queries against the index index_name
by utilizing the QueryProcessor module.
Args
- index_name (
str): The name of the index to be retrieved against. - queries (
str | list[str]): The query of list of queries to be executed. - k (
int): The number of documents associated with the tables to be retrieved. - n (
int): The multiplicative factor ofkto pool more relevant documents for the hybrid retrieval process. - alpha (
float): The weighting factor of the vector and full-text retrievers within a hybrid index. Loweralphagives more weight to the vector retriever.
Returns
str: A JSON string representing the result of the process (Response).
__hf_login
__hf_login()
Logs into Hugging Face if a token is provided.
__init_registrar
__init_registrar()
Initializes the Registrar module.
__init_summarizer
__init_summarizer()
Initializes the Summarizer module.
__init_index_generator
__init_index_generator()
Initializes the IndexGenerator module.
__init_query_processor
__init_query_processor()
Initializes the QueryProcessor module.
__init_llm
__init_llm()
Initializes the LLM.
__init_embed_model
__init_embed_model()
Initializes the embedding model.