[Question]: What's the recommendation in general for SQL data, should sql data be embedded or relying on SQL Query Engine #8422

playerjamesbattleground · 2023-09-25T07:44:10Z

playerjamesbattleground
Sep 25, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

In our use case, we have our data stored in PostgreSQL as the foundation of our private knowledge base. We want to implement the RAG concept with llama_index, and achieve a chat for knowledge experience for our users.

Had been going through the documentations, I found the examples such as https://github.com/jerryjliu/llama_index/blob/main/docs/examples/index_structs/struct_indices/SQLIndexDemo.ipynb
are more assuming data stays at the SQL database side, and change the query approach from native sql query into the Gen AI enabled generative experience, via for example, NLSQLTableQueryEngine. I understand the solution, however, I would like more instruction on how to

Pragmatically test and demonstrate the query accuracy
improve the precision of sql query generated, especially when the actual queries getting complicated and have multiple tables involved.

On the other side, I also wonder, and this might be the core of this question, is whether it's also a recommended approach to load and embed the sql data into a vector database, and query with similarity search instead. Because the data are all in good tabular format already, with well templated prompts and vector schema configuration, I would see it could provide answers with good accuracy too. However I haven't seen much examples or instructions on this approach from documentation. So I would like to hear from the community here about this approach.

@dosu-bot · 2023-09-25T07:55:29Z

dosubot[bot]
bot Sep 25, 2023

🤖

Hello,

Thank you for your detailed question. The choice between keeping SQL data in the database and using NLSQLTableQueryEngine for querying, or loading the data and embedding it into a vector database for similarity search, depends on the specific requirements of your application.

The NLSQLTableQueryEngine is designed to handle complex multi-table queries efficiently. It works by taking a natural language query, converting it into SQL, and then executing the SQL query over a structured database. If your application involves complex multi-table queries, it might be beneficial to keep the data in the SQL database and use the NLSQLTableQueryEngine for querying.

On the other hand, if your application requires similarity search, it might be more efficient to load the data and embed it into a vector database. Vector databases are designed to handle similarity search efficiently, which might be beneficial for improving precision.

However, it's important to note that the choice between these two options also depends on other factors such as the size of your data, the frequency of your queries, and the specific requirements of your application. Therefore, it's recommended to test both options under your specific use case to determine which one provides better query accuracy and precision.

For more details, you can refer to the following sources in the LlamaIndex codebase:

I hope this helps! If you have any more questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

playerjamesbattleground · 2023-09-25T08:27:29Z

playerjamesbattleground
Sep 25, 2023
Author

Thanks Dosu,

The answer is still more on the assumption of enabling generative sql query to query the data from SQL database. It touched on my other option, i.e embed sql data into vectordb, but still hasn't provided some concrete instruction on how to achieve it.

By the way. My test setup is postgres + qdrant and own code to query then package documents, and doing the from_documents embedding with a huggingface offline embedding model ("thenlper/gte-large").

Back on the document instructed approach, which is generate sql query for existing database, I do also have further questions on

can the llm model be an offline model responsible for query generation
if our query indeed is complex enough, for example, having recursive CTE with complex joins and filters, what's the most efficient way to make such query aware to the model
our input could be involving entity extraction from user's prompt, for example when user mentions a term, or a similar name for the term, from a long prompt message, the solution would need be able to recognise that term and provide the knowledge from similarity search about it. Does this defeat the data stays in sql database option and get me back on to embedding the data into vector db for better similarities?

0 replies

@dosu-bot · 2023-09-25T08:41:16Z

dosubot[bot]
bot Sep 25, 2023

🤖

Hello,

Thank you for your detailed questions. Let's address them one by one.

Yes, the LlamaIndex framework can be used with an offline LLM model for query generation. The SQLJoinQueryEngine class in the sql_join_query_engine.py file is designed to handle both SQL and other query engine queries. It first decides whether it needs to query the SQL database or the other query engine. If it decides to query the SQL database, it will also decide whether to augment information with retrieved results from the other query engine.
For complex queries, such as those with recursive CTE, complex joins, and filters, the SQLJoinQueryEngine class supports SQL join synthesis, which is controlled by the use_sql_join_synthesis parameter. If this parameter is set to True, the class will use the provided SQL join synthesis prompt to synthesize a response to the original question using the SQL query, SQL response, transformed other query engine query, and other query engine response.
For the second option, embedding SQL data into a vector database for entity extraction and similarity search, the SQLJoinQueryEngine class can be used. This class is initialized with two QueryEngineTool objects, one for the SQL database (sql_query_tool) and one for the other query engine (other_query_tool). The sql_query_tool.query_engine must be an instance of BaseSQLTableQueryEngine or NLSQLTableQueryEngine, and the other_query_tool.query_engine can be any other query engine.

To use the LlamaIndex framework to generate SQL queries from an offline LLM model, handle complex queries, and embed SQL data into a vector database, you would need to create an instance of the SQLJoinQueryEngine class with the appropriate QueryEngineTool objects and use the from_sql_and_other_query_engines class method to initialize the class with the appropriate SQL and other query engines.

Here are some relevant sources for your reference:

I hope this helps! If you have any further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

playerjamesbattleground · 2023-09-25T08:48:25Z

playerjamesbattleground
Sep 25, 2023
Author

I might need you elaborate on 'the other query engine' and if that essentially means we need embed our data to a vectordb, for the case when we need fuzzy match the term user mentioned to their matched knowledge articles or paragraphs, which is stored in sql database

0 replies

logan-markewich · 2023-09-25T16:49:21Z

logan-markewich
Sep 25, 2023
Maintainer

@playerjamesbattleground embedding sql data really only works for data that isn't highly numerical -- i.e maybe a db of Q&A pairs.

You can use the DB loader here and load into a vector db. Then you could use the SQL join engine to combine them
https://llamahub.ai/l/database

https://gpt-index.readthedocs.io/en/stable/examples/query_engine/SQLJoinQueryEngine.html

0 replies

playerjamesbattleground · 2023-09-25T22:36:35Z

playerjamesbattleground
Sep 25, 2023
Author

thanks @logan-markewich, should I understand in a way that since the SQL data normally is highly relational, it's not something similarity would be taking advantage of to native sql query, which can easily join things together.

I will dig more to the things like sql join engine, and maybe maintain a 'terminologies' or 'topic' collection in vectordb for a model to understand what user is asking, then join the sql engine for more sql data based knowledge textuals, and see how that works.

I did my own research but didnt really find good writings that summarise the pro and cons to embed sql data, or when to use and when not, hence this whole question post. Not sure if you are aware any.

0 replies

playerjamesbattleground · 2023-09-25T22:41:28Z

playerjamesbattleground
Sep 25, 2023
Author

There is a follow-up comment to above, I found the high level design of sql query engines are firstly deciding whether to query sql server, then join the unstructured query results from say vectordb. I wonder if this order can change? i.e. firstly query vectordb and extract information to query sql server with? which is much closer to our use case that user ask any questions and the Gen AI part extracts key terms, then query db with them for proper knowledge textual.

0 replies

playerjamesbattleground · 2023-09-26T23:16:59Z

playerjamesbattleground
Sep 26, 2023
Author

Having tested the sql_join_query_engine, I found above is true that it either queries the secondary choice of query engine, i.e vectordb maybe, or query sql then the second choice.

Could this be an improvement idea to let user choose the order? in my case, I want to query vectordb first and with the info it found, to query sql server after

0 replies

playerjamesbattleground · 2023-10-02T22:59:59Z

playerjamesbattleground
Oct 2, 2023
Author

hello I do have above follow up questions that's expecting some comments, please don't close this

1 reply

dosubot[bot] bot Oct 24, 2023

🤖

Hello,

Thank you for your detailed questions. I'll try to answer them based on the information available in the LlamaIndex repository and the context provided.

Yes, the LlamaIndex framework does support the use of offline models, specifically HuggingFace models, for query generation in the LLM applications. This is evident from the provided context where the code includes a list of HuggingFace models under BGE_MODELS and INSTRUCTOR_MODELS. The function get_query_instruct_for_model_name(model_name: Optional[str]) -> str is used to get the query instruction for a given model name. If the model name is in the list of INSTRUCTOR_MODELS or BGE_MODELS, it returns the corresponding instruction. The format_query function then uses this instruction to format the query. You can find more details in the huggingface_utils.py and huggingface.py files.
As for the use of recursive CTE with complex joins and filters in SQL queries, I couldn't find a specific answer within the repository. However, the LlamaIndex framework is designed to be flexible and extensible, so it might be possible to implement this feature with some modifications to the code. I would recommend looking into the SQLJoinQueryEngine and the SQLQueryEngine classes for more information on how SQL queries are handled.
The LlamaIndex framework does support entity extraction from the user's prompt. However, whether this would defeat the data stays in SQL database option and require embedding the data into a vector database for better similarities would depend on the specific requirements of your application and the complexity of the user's prompts. You might need to experiment with different approaches to find the one that works best for your use case.
The order of querying in the SQLJoinQueryEngine is currently fixed, with the SQL server being queried first and then the unstructured query results from the vector database. However, it might be possible to change this order with some modifications to the code. I would recommend looking into the SQLJoinQueryEngine class for more information on how the querying process is handled.
As for your suggestion to let the user choose the order of querying, this could indeed be a valuable improvement to the LlamaIndex framework. I would recommend opening a new feature request issue in the repository to propose this idea. The maintainers of the repository would then be able to consider it for future updates.

I hope this helps! If you have any further questions, please don't hesitate to ask.

Best regards,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: What's the recommendation in general for SQL data, should sql data be embedded or relying on SQL Query Engine #8422

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: What's the recommendation in general for SQL data, should sql data be embedded or relying on SQL Query Engine #8422

playerjamesbattleground Sep 25, 2023

Question Validation

Question

Replies: 9 comments · 1 reply

dosubot[bot] bot Sep 25, 2023

Sources

playerjamesbattleground Sep 25, 2023 Author

dosubot[bot] bot Sep 25, 2023

Sources

playerjamesbattleground Sep 25, 2023 Author

logan-markewich Sep 25, 2023 Maintainer

playerjamesbattleground Sep 25, 2023 Author

playerjamesbattleground Sep 25, 2023 Author

playerjamesbattleground Sep 26, 2023 Author

playerjamesbattleground Oct 2, 2023 Author

dosubot[bot] bot Oct 24, 2023

Sources

playerjamesbattleground
Sep 25, 2023

Replies: 9 comments 1 reply

dosubot[bot]
bot Sep 25, 2023

playerjamesbattleground
Sep 25, 2023
Author

dosubot[bot]
bot Sep 25, 2023

playerjamesbattleground
Sep 25, 2023
Author

logan-markewich
Sep 25, 2023
Maintainer

playerjamesbattleground
Sep 25, 2023
Author

playerjamesbattleground
Sep 25, 2023
Author

playerjamesbattleground
Sep 26, 2023
Author

playerjamesbattleground
Oct 2, 2023
Author