[Question]: What's the recommendation in general for SQL data, should sql data be embedded or relying on SQL Query Engine #8422
Replies: 9 comments 1 reply
-
🤖 Hello, Thank you for your detailed question. The choice between keeping SQL data in the database and using NLSQLTableQueryEngine for querying, or loading the data and embedding it into a vector database for similarity search, depends on the specific requirements of your application. The NLSQLTableQueryEngine is designed to handle complex multi-table queries efficiently. It works by taking a natural language query, converting it into SQL, and then executing the SQL query over a structured database. If your application involves complex multi-table queries, it might be beneficial to keep the data in the SQL database and use the NLSQLTableQueryEngine for querying. On the other hand, if your application requires similarity search, it might be more efficient to load the data and embed it into a vector database. Vector databases are designed to handle similarity search efficiently, which might be beneficial for improving precision. However, it's important to note that the choice between these two options also depends on other factors such as the size of your data, the frequency of your queries, and the specific requirements of your application. Therefore, it's recommended to test both options under your specific use case to determine which one provides better query accuracy and precision. For more details, you can refer to the following sources in the LlamaIndex codebase:
I hope this helps! If you have any more questions, feel free to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Thanks Dosu, The answer is still more on the assumption of enabling generative sql query to query the data from SQL database. It touched on my other option, i.e embed sql data into vectordb, but still hasn't provided some concrete instruction on how to achieve it. By the way. My test setup is postgres + qdrant and own code to query then package documents, and doing the from_documents embedding with a huggingface offline embedding model ("thenlper/gte-large"). Back on the document instructed approach, which is generate sql query for existing database, I do also have further questions on
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, Thank you for your detailed questions. Let's address them one by one.
To use the LlamaIndex framework to generate SQL queries from an offline LLM model, handle complex queries, and embed SQL data into a vector database, you would need to create an instance of the Here are some relevant sources for your reference:
I hope this helps! If you have any further questions, feel free to ask. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
I might need you elaborate on 'the other query engine' and if that essentially means we need embed our data to a vectordb, for the case when we need fuzzy match the term user mentioned to their matched knowledge articles or paragraphs, which is stored in sql database |
Beta Was this translation helpful? Give feedback.
-
@playerjamesbattleground embedding sql data really only works for data that isn't highly numerical -- i.e maybe a db of Q&A pairs. You can use the DB loader here and load into a vector db. Then you could use the SQL join engine to combine them https://gpt-index.readthedocs.io/en/stable/examples/query_engine/SQLJoinQueryEngine.html |
Beta Was this translation helpful? Give feedback.
-
thanks @logan-markewich, should I understand in a way that since the SQL data normally is highly relational, it's not something similarity would be taking advantage of to native sql query, which can easily join things together. I will dig more to the things like sql join engine, and maybe maintain a 'terminologies' or 'topic' collection in vectordb for a model to understand what user is asking, then join the sql engine for more sql data based knowledge textuals, and see how that works. I did my own research but didnt really find good writings that summarise the pro and cons to embed sql data, or when to use and when not, hence this whole question post. Not sure if you are aware any. |
Beta Was this translation helpful? Give feedback.
-
There is a follow-up comment to above, I found the high level design of sql query engines are firstly deciding whether to query sql server, then join the unstructured query results from say vectordb. I wonder if this order can change? i.e. firstly query vectordb and extract information to query sql server with? which is much closer to our use case that user ask any questions and the Gen AI part extracts key terms, then query db with them for proper knowledge textual. |
Beta Was this translation helpful? Give feedback.
-
Having tested the sql_join_query_engine, I found above is true that it either queries the secondary choice of query engine, i.e vectordb maybe, or query sql then the second choice. Could this be an improvement idea to let user choose the order? in my case, I want to query vectordb first and with the info it found, to query sql server after |
Beta Was this translation helpful? Give feedback.
-
hello I do have above follow up questions that's expecting some comments, please don't close this |
Beta Was this translation helpful? Give feedback.
-
Question Validation
Question
In our use case, we have our data stored in PostgreSQL as the foundation of our private knowledge base. We want to implement the RAG concept with llama_index, and achieve a chat for knowledge experience for our users.
Had been going through the documentations, I found the examples such as https://github.com/jerryjliu/llama_index/blob/main/docs/examples/index_structs/struct_indices/SQLIndexDemo.ipynb
are more assuming data stays at the SQL database side, and change the query approach from native sql query into the Gen AI enabled generative experience, via for example, NLSQLTableQueryEngine. I understand the solution, however, I would like more instruction on how to
On the other side, I also wonder, and this might be the core of this question, is whether it's also a recommended approach to load and embed the sql data into a vector database, and query with similarity search instead. Because the data are all in good tabular format already, with well templated prompts and vector schema configuration, I would see it could provide answers with good accuracy too. However I haven't seen much examples or instructions on this approach from documentation. So I would like to hear from the community here about this approach.
Beta Was this translation helpful? Give feedback.
All reactions