Earn 70,000 ($700.00)
Embedding Workflow Creation
Bounty Description
Title: Embedding Workflow Creation
Overview
This Bounty has three main pieces. The first part is the creation of embeddings themselves, which involves generating embeddings from various file formats such as .docx, .doc, .txt, .csv, .xls, .xlsx, and .pdf. The embeddings will be stored in a SQLite3 database table along with an outline that explains the text the embeddings represent. The second part is to load the created embeddings into a vector database. While Pinecone is the primary choice due to its reputation and focus on AI embeddings and long-term memory, alternative options like Milvus can be explored based on developer input. The third part involves creating a simple user interface that utilizes OpenAI API Chat Completions or a similar chatbot framework. The interface will provide semantic similarity search capabilities by leveraging the stored embeddings. The chatbot will analyze user prompts, perform similarity searches on the embeddings, and generate high-quality, data-based responses tailored to the user's needs.
Task Breakout to Complete Project
Stage 1: Embedding Creation
Create an embedding ingestion workflow that can handle various file formats. Users can place items in a designated folder, and the script will process the unstructured data, generate embeddings, and store them in a SQLite3 database table. The database will include separate rows for the vectorized embedding data and an outline explaining the text the embeddings represent.
Stage 2: Load Embeddings into Vector Database
Load the embeddings created in Stage 1 into a vector database, primarily Pinecone. However, other options like Milvus can be considered based on developer input and their advantages for the project's requirements.
Stage 3: Semantic Similarity Search
Develop a user interface that utilizes OpenAI API Chat Completions or a similar chatbot framework. The interface will allow users to engage in conversation and seek assistance based on semantic similarity search. The chatbot will analyze user prompts, leverage the stored embeddings, and process the returned similarity search data. By combining the user's prompt, context, and the data from the embeddings, the chatbot will generate highly accurate and useful responses. This stage can be applied to various use cases, such as field service report assistance, where technicians can communicate their observations, receive accurate diagnoses, and efficient remedies.
I have working python chat completion chatbots in streamlit and flask that can serve as the jumping off point for the this UI (if helpful, not required - feel free to use your own etc.)