MachineIntelligence-TextAnalytics-TPLDataFlows

Machine Intelligence Text Analytics Enrichment implemented using Task Parallel Library Data Flow Pipelines:

Document Enrichment Pipeline - Builds the entire Vector Database using OpenAI embeddings in SQL using 50 selected books
Q&A Over Vector Database Pipeline - Searches the SQL Vector Database with provided question phrase using Semantic Kernel
Total Text (OpenAI) Tokens Processed:...............8,267,408
Total Text (Characters) Length Processed:..........33,702,085
Total cost for processing and building Vector Database using OpenAI Embeddings (Feb 2024 prices):
- text-embedding-ada-002 with 1536 dimensions: ~$0.84 (~84 cents; this depends on how the chunking of text is configured)
- text-embedding-3-small with 512 dimensions: ~$0.17 (~17 cents; this depends on how the chunking of text is configured)

Features:

The console app uses 50 selected books from the Project Gutenberg site from various authors: Oscar Wilde, Bram Stoker, Edgar Allen Poe, Alexandre Dumas and performs enrichment using multiple AI enrichment steps
Downloads book text, processes text analytics & embeddings, creates a vector database in SQL, demonstrates vector search and answers a sample question using semantic meaning from OpenAI embeddings
Stores all enrichment output for each book in a seperate JSON file
Rather than processing text analytics enrichment in single synchronous steps, it uses an data flow model to create efficient pipelines that can saturate multiple logical CPU cores
Illustrates that SQL Server or Azure SQL can be used as a valid Vector Store, can perform vector search and provide Q&A over the database
Demonstrates how to create a Machine Intelligence & Text Analytics Pipeline can be combbined using TPL DataFlows
The console application is cross-platform .NET 8.x. It will run on macOS, Linux, Windows 10/11 x64, Windows 11 ARM

Requirements:

Visual Studio 2022, .NET 9.x: https://dotnet.microsoft.com/en-us/download/dotnet/9.0
SQL Server Connection to either a local SQL Server 2022 (free Devolpment SKU or higher) or Azure SQL Database
******Note: SQL Server 2022 / Azure SQL Database features are used for JSON processing and ordered Columnstore Indexes
Azure OpenAI for both embeddings and Chat Completions

Getting Started - Step 1) Configuration of SQL Connection and OpenAI API Keys (example of secrets.json shown below)

Ensure to add .NET Secrets or JSON configuration (you will need to add the JSON code if using a file)
Right-click on the C# Project and select "Manage User Secrets"
Add the SQL Connection (SQLConnection) and OpenAI (APIKey) (if using Azure OpenAPI, use AzureOpenAPI section)

{
  "SQL": {
    "SqlConnection": "Server=[NAME OF SERVER],1433;Initial Catalog=MachineIntelligenceDb;Persist Security Info=False;User ID=[USERID];Password=[PASSWORD];MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=5000;"
  },
  "OpenAI": {
    "APIKey": "[YOUR OPENAPI KEY]"
  },
  "AzureOpenAI": {
    "APIKey": "[YOUR AZURE OPENAPI KEY]"
  }
}

Getting Started - Step 2) Processing (after adding proper SQL and OpenAI/Azure OpenAI connections):

Select option 1 to process the entire Data Enrichment Pipeline (build the embeddings Vector Database in SQL)
Select option 2 to only process the Q&A pipeline using Semantic Kernel over the Vector Database (Note: Option #1 must have been run beforehand)
Select option 3 to only process the Q&A pipeline with reasoning using Semantic Kernel over the Vector Database (Note: Option #1 must have been run beforehand). This option is similar to option #2 except it provides details on how the AI agent achieved the results.

Learn more about the concepts used in this repository:

Semenantic Kernel: https://aka.ms/semantic-kernel
OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings
TPL Dataflows .NET: https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library
SQL Server Columnstore Indexes: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview
Project Gutenberg (over 70,000 free eBooks): https://www.gutenberg.org/
SharpToken (C# for encoding/decoding LLM tokens): https://github.com/dmitry-brazhenko/SharpToken
Use .NET secrets in a Console Application: https://www.programmingwithwolfgang.com/use-net-secrets-in-console-application/

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.devcontainer		.devcontainer
.vs		.vs
MachineIntelligenceTPLDataFlows		MachineIntelligenceTPLDataFlows
Presentations		Presentations
.gitignore		.gitignore
ConsoleScreenshot.png		ConsoleScreenshot.png
LICENSE.md		LICENSE.md
MachineIntelligence-TPLDataFlows.sln		MachineIntelligence-TPLDataFlows.sln
README.md		README.md
TPLDataFlows-ConsoleApp.png		TPLDataFlows-ConsoleApp.png
TPLDataFlows-Pipeline.png		TPLDataFlows-Pipeline.png
TPLDataFlows-Pipeline.pptx		TPLDataFlows-Pipeline.pptx
TPLDataFlows-Secrets.png		TPLDataFlows-Secrets.png
TPLVectorEmbeddingsProcessingConsole.gif		TPLVectorEmbeddingsProcessingConsole.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MachineIntelligence-TextAnalytics-TPLDataFlows

Machine Intelligence Text Analytics Enrichment implemented using Task Parallel Library Data Flow Pipelines:

Features:

Requirements:

Getting Started - Step 1) Configuration of SQL Connection and OpenAI API Keys (example of secrets.json shown below)

Getting Started - Step 2) Processing (after adding proper SQL and OpenAI/Azure OpenAI connections):

Learn more about the concepts used in this repository:

About

Releases

Packages

Contributors 2

Languages

License

bartczernicki/MachineIntelligence-TextAnalytics-TPLDataFlows

Folders and files

Latest commit

History

Repository files navigation

MachineIntelligence-TextAnalytics-TPLDataFlows

Machine Intelligence Text Analytics Enrichment implemented using Task Parallel Library Data Flow Pipelines:

Features:

Requirements:

Getting Started - Step 1) Configuration of SQL Connection and OpenAI API Keys (example of secrets.json shown below)

Getting Started - Step 2) Processing (after adding proper SQL and OpenAI/Azure OpenAI connections):

Learn more about the concepts used in this repository:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages