Overview
Pneuma
The entry point of Pneuma
, combining all modules for ther purpose of LLM-based
table discovery.
This class provides end-to-end methods from indexing tables (and their metadata, if any) to retrieving tables given users' queries.
Attributes
- out_path (
str
): The output folder of Pneuma. - db_path (
str
): The database path within the output folder of Pneuma. - index_location (
str
): The index path within the output folder of Pneuma. - hf_token (
str
): A HuggingFace User Access Tokens. - openai_api_key (
str
): An OpenAI API key. - use_local_model (
bool
): The option to use local or third-party models (for now, OpenAI models only as both LLM and embedding model). - llm_path (
str
): The path or name of a local LLM from HuggingFace. - embed_path (
str
): The path or name of a local embedding model from HuggingFace. - max_llm_batch_size (
int
): Maximum batch size for the dynamic batch size selector to explore. - registrar (
Registrar
): The dataset regisration module. - summarizer (
Summarizer
): The dataset summarizer module. - index_generator (
IndexGenerator
): The index generator module. - query_processor (
QueryProcessor
): The query processor module. - llm (
OpenAI | TextGenerationPipeline
): The actual LLM (lazily initialized). - embed_model (
OpenAI | SentenceTransformer
): The actual embedding model (lazily initialized).
__init__
__init__(
out_path: Optional[str] = None,
hf_token: Optional[str] = None,
openai_api_key: Optional[str] = None,
use_local_model: bool = True,
llm_path: str = "Qwen/Qwen2.5-7B-Instruct",
embed_path: str = "BAAI/bge-base-en-v1.5",
max_llm_batch_size: int = 50,
)
setup
setup() -> str
Setup Pneuma through its Registrar
module.
add_tables
add_tables(
path: str,
creator: str,
source: str = "file",
s3_region: str = None,
s3_access_key: str = None,
s3_secret_access_key: str = None,
accept_duplicates: bool = False,
) -> str
Registers tables into the database by utilizing the Registrar
module.
Returns
str
: A JSON string representing the result of the process (Response
).
add_metadata
add_metadata(metadata_path: str, table_id: str = '') -> str
Registers metadata into the database by utilizing the Registrar
module.
Returns
str
: A JSON string representing the result of the process (Response
).
summarize
summarize(table_id: str = None) -> str
Summarizes the contents of all unsummarized tables or a specific table
if table_id
is provided using the Summarizer
module.
Args
- table_id (
str
): The specific table ID to be summarized.
Returns
str
: A JSON string representing the result of the process (Response
).
generate_index
generate_index(
index_name: str, table_ids: list[str] | tuple[str] = None
) -> str
Generates a hybrid index with name index_name
for a given table_ids
by utilizing the IndexGenerator
module.
Args
- index_name (
str
): The name of the index to be generated. - table_ids (
list[str] | tuple[str]
): The IDs of tables to be indexed (to be exact, their content summaries & context/metadata).
Returns
str
: A JSON string representing the result of the process (Response
).
query_index
query_index(
index_name: str,
queries: list[str] | str,
k: int = 1,
n: int = 5,
alpha: int = 0.5,
) -> str
Retrieves tables for the given queries
against the index index_name
by utilizing the QueryProcessor
module.
Args
- index_name (
str
): The name of the index to be retrieved against. - queries (
str | list[str]
): The query of list of queries to be executed. - k (
int
): The number of documents associated with the tables to be retrieved. - n (
int
): The multiplicative factor ofk
to pool more relevant documents for the hybrid retrieval process. - alpha (
float
): The weighting factor of the vector and full-text retrievers within a hybrid index. Loweralpha
gives more weight to the vector retriever.
Returns
str
: A JSON string representing the result of the process (Response
).
__hf_login
__hf_login()
Logs into Hugging Face if a token is provided.
__init_registrar
__init_registrar()
Initializes the Registrar module.
__init_summarizer
__init_summarizer()
Initializes the Summarizer module.
__init_index_generator
__init_index_generator()
Initializes the IndexGenerator module.
__init_query_processor
__init_query_processor()
Initializes the QueryProcessor module.
__init_llm
__init_llm()
Initializes the LLM.
__init_embed_model
__init_embed_model()
Initializes the embedding model.