From 9b88070be422f3db2677ff42f55820b1fa1271bf Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Wed, 20 Nov 2024 14:06:19 -0500 Subject: [PATCH 01/25] First pass at changes. --- .../postgres_for_kubernetes/1/iron-bank.mdx | 52 ++++++++++++++++--- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index 13f7e7b85a9..b5b2108e452 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -23,7 +23,7 @@ Iron Bank is a part of DoD's [Platform One](https://p1.dso.mil/). You will need your Iron Bank credentials to access the Iron Bank page for [EDB Postgres for Kubernetes](https://repo1.dso.mil/dsop/enterprisedb/edb-pg4k-operator). -## Pulling the EDB PG4K image from Iron Bank +## Pulling the EDB PG4K and operand images from Iron Bank The images are pulled from the separate [Iron Bank container registry](https://registry1.dso.mil/). To be able to pull images from the Iron Bank registry, please follow the @@ -33,19 +33,25 @@ Specifically, you will need to use your [registry1](https://registry1.dso.mil/harbor/projects) credentials to pull images. -To find the desired operator image, we recommend to use the search tool to look +To find the desired operator or operand images, we recommend to use the search tool to look with the string `edb`, and filter by `Tags`, looking for `stable`, as shown in -the image. From there, you can get the instruction to pull the image, for -example using Docker: +the image. From there, you can get the instruction to pull the image: ![pulling-ironbank-images](./images/ironbank/pulling-the-image.png) +For example, to pull the latest EPAS16 operand from Ironbank, you can run: + +```bash +docker pull registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16:16 + +``` + +If you want to pick a more specific tag or use a specific SHA, you need to find it from the [Harbor page](https://registry1.dso.mil/harbor/projects/3/repositories/enterprisedb%2Fedb-postgres-adbanced-server%2Fedb-postgres-advanced-16/artifacts-tab). + ## Installing the PG4K operator using the Iron Bank image -For installation, you will need a deployment manifest that points to your Iron -Bank image. -You can take the deployment manifest from the -[installation instructions for EDB PG4K](/postgres_for_kubernetes/latest/installation_upgrade/). +For installation, you will need a deployment manifest that points to your Iron Bank image. +You can take the deployment manifest from the [installation instructions for EDB PG4K](/postgres_for_kubernetes/latest/installation_upgrade/). For example, for the 1.22.0 release, the manifest is available at `https://get.enterprisedb.io/cnp/postgresql-operator-1.22.0.yaml`. \\ There are a couple of places where you will need to set the image path for the @@ -90,3 +96,33 @@ directly to your Kubernetes nodes. Once you have this in place, you can apply your manifest normally with `kubectl apply -f`, as described in the [installation instructions](/postgres_for_kubernetes/latest/installation_upgrade/). + +## Deploy clusters with EPAS operands using their Iron Bank image + +To deploy a cluster using an [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. +For example, for an EPAS 16 operand: + +1. Create or edit a Custom Resource (CR) YAML file with the following content: + + ```yaml + apiVersion: postgresql.k8s.enterprisedb.io/v1 + kind: Cluster + metadata: + name: cluster-example-full + spec: + imageName: registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16@16.4 + imagePullSecrets: + - name: my_ironbank_secret + ``` + +2. Apply the YAML: + + ``` + kubectl apply -f + ``` + +3. Verify the status of the resource: + + ``` + kubectl get clusters + ``` From 18b867ca8dbaf2c282d9814143a0a748a5ba4f44 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Wed, 20 Nov 2024 14:12:25 -0500 Subject: [PATCH 02/25] Small changes --- product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index b5b2108e452..486c02a0e13 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -100,7 +100,7 @@ Once you have this in place, you can apply your manifest normally with ## Deploy clusters with EPAS operands using their Iron Bank image To deploy a cluster using an [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. -For example, for an EPAS 16 operand: +For example, for an EPAS 16.4 operand: 1. Create or edit a Custom Resource (CR) YAML file with the following content: From 6df8840b46390140ec323f303fef03c6ca95d256 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Wed, 20 Nov 2024 14:28:24 -0500 Subject: [PATCH 03/25] Changed title to gerund --- product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index 486c02a0e13..2821f60b469 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -97,7 +97,7 @@ directly to your Kubernetes nodes. Once you have this in place, you can apply your manifest normally with `kubectl apply -f`, as described in the [installation instructions](/postgres_for_kubernetes/latest/installation_upgrade/). -## Deploy clusters with EPAS operands using their Iron Bank image +## Deploying clusters with EPAS operands using their Iron Bank image To deploy a cluster using an [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. For example, for an EPAS 16.4 operand: From febc4a669fc7b26016814b80c3706b3c44a92888 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Wed, 20 Nov 2024 14:32:35 -0500 Subject: [PATCH 04/25] Small changes. --- product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index 2821f60b469..9440657e112 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -100,7 +100,7 @@ Once you have this in place, you can apply your manifest normally with ## Deploying clusters with EPAS operands using their Iron Bank image To deploy a cluster using an [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. -For example, for an EPAS 16.4 operand: +For example, to deploy a PG4K cluster using a EPAS 16.4 operand: 1. Create or edit a Custom Resource (CR) YAML file with the following content: From 17922af65b03cd8623e29565fc5ff554c7562e4e Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Mon, 2 Dec 2024 14:13:29 -0500 Subject: [PATCH 05/25] Small change in wording. --- product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index 9440657e112..f6424a48478 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -99,7 +99,7 @@ Once you have this in place, you can apply your manifest normally with ## Deploying clusters with EPAS operands using their Iron Bank image -To deploy a cluster using an [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. +To deploy a cluster using the EPAS [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. For example, to deploy a PG4K cluster using a EPAS 16.4 operand: 1. Create or edit a Custom Resource (CR) YAML file with the following content: From 0f566995462b46c8fcdf0969fc6b61362181afcc Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Mon, 2 Dec 2024 14:16:10 -0500 Subject: [PATCH 06/25] Another small word change. --- product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index f6424a48478..31a815592ec 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -39,7 +39,7 @@ the image. From there, you can get the instruction to pull the image: ![pulling-ironbank-images](./images/ironbank/pulling-the-image.png) -For example, to pull the latest EPAS16 operand from Ironbank, you can run: +For example, to pull the EPAS16 operand from Ironbank, you can run: ```bash docker pull registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16:16 From b79b52793d44a60d9a4b3a0529fb999f3a393c52 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Niccol=C3=B2=20Fei?= Date: Wed, 4 Dec 2024 16:29:52 +0100 Subject: [PATCH 07/25] docs: review --- .../postgres_for_kubernetes/1/iron-bank.mdx | 21 +++++++++---------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index 31a815592ec..470286ea671 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -34,7 +34,7 @@ Specifically, you will need to use your credentials to pull images. To find the desired operator or operand images, we recommend to use the search tool to look -with the string `edb`, and filter by `Tags`, looking for `stable`, as shown in +with the string `enterprisedb`, and filter by `Tags`, looking for `stable`, as shown in the image. From there, you can get the instruction to pull the image: ![pulling-ironbank-images](./images/ironbank/pulling-the-image.png) @@ -46,16 +46,15 @@ docker pull registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16:16 ``` -If you want to pick a more specific tag or use a specific SHA, you need to find it from the [Harbor page](https://registry1.dso.mil/harbor/projects/3/repositories/enterprisedb%2Fedb-postgres-adbanced-server%2Fedb-postgres-advanced-16/artifacts-tab). +If you want to pick a more specific tag or use a specific SHA, you need to find it from the [Harbor page](https://registry1.dso.mil/harbor/projects/3/repositories/enterprisedb%2Fedb-postgres-advanced-16/artifacts-tab). ## Installing the PG4K operator using the Iron Bank image For installation, you will need a deployment manifest that points to your Iron Bank image. You can take the deployment manifest from the [installation instructions for EDB PG4K](/postgres_for_kubernetes/latest/installation_upgrade/). For example, for the 1.22.0 release, the manifest is available at -`https://get.enterprisedb.io/cnp/postgresql-operator-1.22.0.yaml`. \\ -There are a couple of places where you will need to set the image path for the -IronBank image. +`https://get.enterprisedb.io/cnp/postgresql-operator-1.22.0.yaml`. +There are a couple of places where you will need to set the image path for the IronBank image. ```yaml apiVersion: apps/v1 @@ -97,12 +96,12 @@ directly to your Kubernetes nodes. Once you have this in place, you can apply your manifest normally with `kubectl apply -f`, as described in the [installation instructions](/postgres_for_kubernetes/latest/installation_upgrade/). -## Deploying clusters with EPAS operands using their Iron Bank image +## Deploying clusters with EPAS operands using IronBank images -To deploy a cluster using the EPAS [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) from a pulled image, you must reference the Ironbank operand image appropriately in a Custom Resource (CR) file. -For example, to deploy a PG4K cluster using a EPAS 16.4 operand: +To deploy a cluster using the EPAS [operand](/postgres_for_kubernetes/latest/private_edb_registries/#operand-images) you must reference the Ironbank operand image appropriately in the `Cluster` resource YAML. +For example, to deploy a PG4K Cluster using the EPAS 16 operand: -1. Create or edit a Custom Resource (CR) YAML file with the following content: +1. Create or edit a `Cluster` resource YAML file with the following content: ```yaml apiVersion: postgresql.k8s.enterprisedb.io/v1 @@ -110,9 +109,9 @@ For example, to deploy a PG4K cluster using a EPAS 16.4 operand: metadata: name: cluster-example-full spec: - imageName: registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16@16.4 + imageName: registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16@16 imagePullSecrets: - - name: my_ironbank_secret + - name: my_ironbank_secret ``` 2. Apply the YAML: From 226e7a3d205d1c2b63d368fbb9aeaaca7fcc8425 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Thu, 12 Dec 2024 13:11:30 -0500 Subject: [PATCH 08/25] First draft added sub-json. --- .../5/06_features_of_mongo_fdw.mdx | 55 +++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index 7bd716a8264..48b4443b0d9 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -96,3 +96,58 @@ Steps for retrieving the document: { "_id" : { "$oid" : "58a1ebbaf543ec0b9054585a" }, "warehouse_id" : 2, "warehouse_name" : "Laptop", "warehouse_created" : { "$date" : 1447229590000 } } (2 rows) ``` + +## Accessing sub-json + +MongoDB Foreign Data Wrapper supports retrieving sub-json, by allowing you to retrieve sub-fields from a document. +You can retrieve sub-fields in a collection residing in MongoDB Foreign Data Wrapper as explained in the following example: + +```text +db1> db.test_sub_json.find() +[ + { + _id: ObjectId('658040214890799d6e0173d0'), + key1: 'hello', + key2: { + subkey21: 'hello-sub1', + subkey22: 'hello-sub2', + subtstmp: ISODate('2022-12-16T19:16:17.801Z') + } + } +] +``` + +Steps for retrieving sub-fields from the document: + +1. Create a forign table. To access a sub-field use the dot (".") in the columnn name as shown below: + +```sql +CREATE FOREIGN TABLE ft_nested_json_test( + _id NAME, + key1 varchar, + "key2.subkey21" varchar, + "key2.subkey22" varchar, + "key2.subtstmp" timestamp +)SERVER mongo_server +OPTIONS (database 'db1', collection 'test-sub_json'); +``` + +2. Retrieve the document with sub-fields: + +```sql +SELECT * FROM ft_nested_json_test; +__OUTPUT__ + _id | key1 | key2.subkey21 | key2.subkey22 | key2.subtstmp +--------------------------+-------+---------------+---------------+------------------------ + 658040214890799d6e0173d0 | hello | hello-sub1 | hello-sub2 | 16-DEC-22 19:16:17.801 +``` + +3. Retrieve an individual field: + +```sql +SELECT "key2.subkey21" FROM ft_nested_json_test; +__OUTPUT__ + key2.subkey21 +--------------- + hello-sub1 +``` \ No newline at end of file From 8a036bdef5f6caaab180b17802c92b4e29ea7e86 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Mon, 16 Dec 2024 12:57:04 -0500 Subject: [PATCH 09/25] Improved language. --- .../mongo_data_adapter/5/06_features_of_mongo_fdw.mdx | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index 48b4443b0d9..5f019a2267d 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -97,10 +97,13 @@ Steps for retrieving the document: (2 rows) ``` -## Accessing sub-json +## Accessing nested fields -MongoDB Foreign Data Wrapper supports retrieving sub-json, by allowing you to retrieve sub-fields from a document. -You can retrieve sub-fields in a collection residing in MongoDB Foreign Data Wrapper as explained in the following example: +MongoDB Foreign Data Wrapper allows you to access individual fields within nested JSON documents by mapping the nested structure to columns in a foreign table. +This works by mapping the nested structure of the MongoDB document to relational columns in the foreign table definition, using dot notation (key2.subkey21) to reference nested fields. +You can retrieve these fields from a collection as shown in the following example: + +### Example ```text db1> db.test_sub_json.find() @@ -119,7 +122,7 @@ db1> db.test_sub_json.find() Steps for retrieving sub-fields from the document: -1. Create a forign table. To access a sub-field use the dot (".") in the columnn name as shown below: +1. Create a foreign table. To access a sub-field use the dot (".") in the columnn name as shown below: ```sql CREATE FOREIGN TABLE ft_nested_json_test( From aef520271880e6068019b3cc711b754b463c242a Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Mon, 16 Dec 2024 13:01:55 -0500 Subject: [PATCH 10/25] Nested sub-field topic under full document retrieval --- .../docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index 5f019a2267d..54b4d6bf487 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -97,13 +97,13 @@ Steps for retrieving the document: (2 rows) ``` -## Accessing nested fields +### Accessing nested fields MongoDB Foreign Data Wrapper allows you to access individual fields within nested JSON documents by mapping the nested structure to columns in a foreign table. This works by mapping the nested structure of the MongoDB document to relational columns in the foreign table definition, using dot notation (key2.subkey21) to reference nested fields. You can retrieve these fields from a collection as shown in the following example: -### Example +#### Example ```text db1> db.test_sub_json.find() From e98e2b5bb03ea4abfbecd2d46242f32a4f54d568 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Mon, 16 Dec 2024 15:31:28 -0500 Subject: [PATCH 11/25] Fixed indentations of code blocks for using easier numerics --- .../5/06_features_of_mongo_fdw.mdx | 52 +++++++++---------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index 54b4d6bf487..a75170e6b76 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -124,33 +124,33 @@ Steps for retrieving sub-fields from the document: 1. Create a foreign table. To access a sub-field use the dot (".") in the columnn name as shown below: -```sql -CREATE FOREIGN TABLE ft_nested_json_test( - _id NAME, - key1 varchar, - "key2.subkey21" varchar, - "key2.subkey22" varchar, - "key2.subtstmp" timestamp -)SERVER mongo_server -OPTIONS (database 'db1', collection 'test-sub_json'); -``` + ```sql + CREATE FOREIGN TABLE ft_nested_json_test( + _id NAME, + key1 varchar, + "key2.subkey21" varchar, + "key2.subkey22" varchar, + "key2.subtstmp" timestamp + )SERVER mongo_server + OPTIONS (database 'db1', collection 'test-sub_json'); + ``` -2. Retrieve the document with sub-fields: +1. Retrieve the document with sub-fields: -```sql -SELECT * FROM ft_nested_json_test; -__OUTPUT__ - _id | key1 | key2.subkey21 | key2.subkey22 | key2.subtstmp ---------------------------+-------+---------------+---------------+------------------------ - 658040214890799d6e0173d0 | hello | hello-sub1 | hello-sub2 | 16-DEC-22 19:16:17.801 -``` + ```sql + SELECT * FROM ft_nested_json_test; + __OUTPUT__ + _id | key1 | key2.subkey21 | key2.subkey22 | key2.subtstmp + --------------------------+-------+---------------+---------------+------------------------ + 658040214890799d6e0173d0 | hello | hello-sub1 | hello-sub2 | 16-DEC-22 19:16:17.801 + ``` -3. Retrieve an individual field: +1. Retrieve an individual field: -```sql -SELECT "key2.subkey21" FROM ft_nested_json_test; -__OUTPUT__ - key2.subkey21 ---------------- - hello-sub1 -``` \ No newline at end of file + ```sql + SELECT "key2.subkey21" FROM ft_nested_json_test; + __OUTPUT__ + key2.subkey21 + --------------- + hello-sub1 + ``` \ No newline at end of file From a41ba4a82757ba33f9ae82868da2f4c18bd5d397 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Mon, 16 Dec 2024 15:32:06 -0500 Subject: [PATCH 12/25] Update product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx Co-authored-by: gvasquezvargas --- .../docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index a75170e6b76..65107a884d4 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -122,7 +122,7 @@ db1> db.test_sub_json.find() Steps for retrieving sub-fields from the document: -1. Create a foreign table. To access a sub-field use the dot (".") in the columnn name as shown below: +1. Create a foreign table. To access a sub-field use the dot (".") in the column name as shown below: ```sql CREATE FOREIGN TABLE ft_nested_json_test( From 303d5ff4b9cba9bf1d020b251155ac6dfc78c743 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan Date: Tue, 17 Dec 2024 11:34:04 +0000 Subject: [PATCH 13/25] Updated front page Signed-off-by: Dj Walker-Morgan --- product_docs/docs/postgres_for_kubernetes/1/index.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/index.mdx b/product_docs/docs/postgres_for_kubernetes/1/index.mdx index 50e7c62ce2d..aeb11322713 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/index.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/index.mdx @@ -114,8 +114,8 @@ EDB Postgres for Kubernetes was made generally available on February 4, 2021. Ea - Cold backup support with Kasten and Velero/OADP - Generic adapter for third-party Kubernetes backup tools -You can [evaluate EDB Postgres for Kubernetes for free](evaluation.md). -You need a valid license key to use EDB Postgres for Kubernetes in production. +You can [evaluate EDB Postgres for Kubernetes for free](evaluation.md) as part of a trial subscription. +You need a valid EDB subscription to use EDB Postgres for Kubernetes in production. !!! Note @@ -149,8 +149,8 @@ EDB Postgres for Kubernetes works with both PostgreSQL and EDB Postgres Advanced server, and is available under the [EDB Limited Use License](https://www.enterprisedb.com/limited-use-license). -You can [evaluate EDB Postgres for Kubernetes for free](evaluation.md). -You need a valid license key to use EDB Postgres for Kubernetes in production. +You can [evaluate EDB Postgres for Kubernetes for free](evaluation.md) as part of a trial subscription. +You need a valid EDB subscription to use EDB Postgres for Kubernetes in production. ## Supported releases and Kubernetes distributions From a2866074d8661fdf090cf81a3543973ffeff4515 Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Tue, 17 Dec 2024 11:47:51 -0500 Subject: [PATCH 14/25] Small change to formatting. --- .../docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index 65107a884d4..90986ee5dca 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -97,13 +97,13 @@ Steps for retrieving the document: (2 rows) ``` -### Accessing nested fields +## Accessing nested fields MongoDB Foreign Data Wrapper allows you to access individual fields within nested JSON documents by mapping the nested structure to columns in a foreign table. This works by mapping the nested structure of the MongoDB document to relational columns in the foreign table definition, using dot notation (key2.subkey21) to reference nested fields. You can retrieve these fields from a collection as shown in the following example: -#### Example +### Example ```text db1> db.test_sub_json.find() From a6d08775b3323e8f0931591f64de3166fac56d2b Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Tue, 17 Dec 2024 11:52:29 -0500 Subject: [PATCH 15/25] Typo. --- .../docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx index 90986ee5dca..3b3b39cd336 100644 --- a/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx +++ b/product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fdw.mdx @@ -132,7 +132,7 @@ Steps for retrieving sub-fields from the document: "key2.subkey22" varchar, "key2.subtstmp" timestamp )SERVER mongo_server - OPTIONS (database 'db1', collection 'test-sub_json'); + OPTIONS (database 'db1', collection 'test_sub_json'); ``` 1. Retrieve the document with sub-fields: From 4e6caa5460dd813545f60981f68aee42691aa98a Mon Sep 17 00:00:00 2001 From: Josh Earlenbaugh Date: Tue, 17 Dec 2024 12:13:28 -0500 Subject: [PATCH 16/25] Fixed typo. --- product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx index 470286ea671..70e7a1e51fd 100644 --- a/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx +++ b/product_docs/docs/postgres_for_kubernetes/1/iron-bank.mdx @@ -109,7 +109,7 @@ For example, to deploy a PG4K Cluster using the EPAS 16 operand: metadata: name: cluster-example-full spec: - imageName: registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-16@16 + imageName: registry1.dso.mil/ironbank/enterprisedb/edb-postgres-advanced-17:17 imagePullSecrets: - name: my_ironbank_secret ``` From 0e0569a95de8524182bc0690ef0644fbbd4650c3 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan Date: Wed, 18 Dec 2024 14:17:46 +0000 Subject: [PATCH 17/25] Update installation_upgrade.mdx fix typo --- .../1/installation_upgrade.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/product_docs/docs/postgres_distributed_for_kubernetes/1/installation_upgrade.mdx b/product_docs/docs/postgres_distributed_for_kubernetes/1/installation_upgrade.mdx index f7abf1abe6e..803efcd66ec 100644 --- a/product_docs/docs/postgres_distributed_for_kubernetes/1/installation_upgrade.mdx +++ b/product_docs/docs/postgres_distributed_for_kubernetes/1/installation_upgrade.mdx @@ -36,7 +36,7 @@ helm repo add edb \ ## Set the environment variables -Set the environment variables for the `REPOSITORY_NAME` and `REPOSITORY_NAME`: +Set the environment variables for the `REPOSITORY_NAME` and `EDB_SUBSCRIPTION_TOKEN`: 1. Set `REPOSITORY_NAME` to the name of the repository. In this example, the images come from the `k8s_enterprise_pgd` repository: From db354265e19b4be37221ac727477b9930b2b9ff9 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan Date: Fri, 1 Nov 2024 18:15:09 +0000 Subject: [PATCH 18/25] First pass update Signed-off-by: Dj Walker-Morgan --- .../analytics/external_tables.mdx | 46 +++ .../edb-postgres-ai/analytics/reference.mdx | 302 ------------------ .../analytics/reference/datasets.mdx | 35 ++ .../analytics/reference/deltatables.mdx | 25 ++ .../analytics/reference/directscan.mdx | 57 ++++ .../analytics/reference/index.mdx | 26 ++ .../analytics/reference/instances.mdx | 31 ++ .../analytics/reference/loadingdata.mdx | 65 ++++ .../reference/providers_and_regions.mdx | 36 +++ .../analytics/reference/queries.mdx | 44 +++ .../analytics/reference/users.mdx | 10 + 11 files changed, 375 insertions(+), 302 deletions(-) create mode 100644 advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx delete mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/users.mdx diff --git a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx new file mode 100644 index 00000000000..f31e0537a1c --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx @@ -0,0 +1,46 @@ +--- +title: Querying Delta tables in S3-compatible object storage +navTitle: External Tables +description: Access and Query data in S3-compatible object storage using External Tables +deepToC: true +--- + +## Overview + +External tables allow you to access and query data stored in S3-compatible object storage using SQL. You can create an external table that references data in S3-compatible object storage and query the data using standard SQL commands. + +## Prerequisites + +An EDB Postgres AI account and a Lakehouse node. + +## Creating an External Storage Location + +The first step is to create an external storage location which references S3-compatible object storage where your data resides. This is done by using SQL to execute the `pgaa.create_storage_location` function. `pgaa` is the name of the extension and namespace that provides the functionality to query external storage locations. + +The following example creates an external table that references an S3-compatible object storage location: + +```sql +SELECT pgaa.create_storage_location('sample-data', 's3://pgaa-sample-data-eu-west-1'); +``` + +The function takes a name for the new storage location, and the URI of the S3-compatible object storage location. + +## Creating an External Table + +After creating the external storage location, you can create an external table that references the data in the storage location. The following example creates an external table that references a Delta table in the S3-compatible object storage location: + +```sql +CREATE TABLE public.customer () USING PGAA WITH (pgaa.storage_location = 'sample-data', pgaa.path = 'tpch_sf_1/customer'); +``` + +## Querying an External Table + +After creating the external table, you can query the data in the external table using standard SQL commands. The following example queries the external table created in the previous step: + +```sql +SELECT COUNT(*) FROM public.customer; +``` + + + + diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference.mdx deleted file mode 100644 index 848981e983f..00000000000 --- a/advocacy_docs/edb-postgres-ai/analytics/reference.mdx +++ /dev/null @@ -1,302 +0,0 @@ ---- -title: Reference - EDB Postgres Lakehouse -navTitle: Reference -description: Things to know about EDB Postgres Lakehouse -deepToC: true ---- - -Postgres Lakehouse is an early product. Eventually, it will support deployment -modes across multiple clouds and on-premises. However, currently it's fairly -limited in terms of where you can deploy it and what data you can query with it. - -To get the best experience with Postgres Lakehouse, you should follow the -"quick start" guide to query benchmarking data. Then you can try loading your -own data with Lakehouse Sync. If you're intrigued, reach out to us and -we can talk more about your use case and potential opportunities. - -This page details some of the important bits to know. - -## Supported cloud providers and regions - -**AWS only**: Currently, support for all Lakehouse features (Lakehouse nodes, -Managed Storage Locations, and Lakehouse Sync) is limited to AWS. - -**EDB-hosted only**: "Bring Your Own Account" (BYOA) regions are NOT currently -supported for Lakehouse resources. Support is limited to -ONLY **EDB Postgres® AI - Hosted** environments on AWS (a.k.a. "EDB-Hosted AWS regions"). - -This means you can select from one of the following regions: - -* North America - * US East 1 - * US East 2 - * US West 2 -* Europe - * EU Central 1 - * EU West 1 - * EU West 2 -* Asia - * AP South 1 -* Australia - * AP SouthEast 2 - -To be precise: - -* Lakehouse nodes can only be provisioned in EDB-hosted AWS regions -* Managed Storage Locations can only be created in EDB-hosted AWS regions -* Lakehouse Sync can only sync from source databases in EDB-hosted AWS regions - -These limitations will be removed as we continue to improve the product. Eventually, -we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. We -will also add better support for "external" buckets ("bring your own bucket"). - -## Supported AWS instances - -When deploying a Lakehouse node, you must choose an instance type from -the `m6id` family of instances. Importantly, these instances come with NVMe -drives attached to them. - -**Instances are ephemeral.** These NVMe drives are used only for spill-out space -*while processing queries, and for caching Delta Tables on disk. -All data on the NVMe drives will be lost when the cluster is shutdown. - -**System tables are persisted.** Persistent data in system tables (users, roles, -*etc.) is stored in an attached -block storage device, and will survive a pause/resume cycle. - -**Supported instances** - -| API Name | Memory | vCPUs | Cores | Storage | -| --------------- | --------- | --------- | ----- | ------------------------------- | -| `m6id.large` | 8.0 GiB | 2 vCPUs | 1 | 118 GB NVMe SSD | -| `m6id.xlarge` | 16.0 GiB | 4 vCPUs | 2 | 237 GB NVMe SSD | -| `m6id.2xlarge` | 32.0 GiB | 8 vCPUs | 4 | 474 GB NVMe SSD | -| `m6id.4xlarge` | 64.0 GiB | 16 vCPUs | 8 | 950 GB NVMe SSD | -| `m6id.8xlarge` | 128.0 GiB | 32 vCPUs | 16 | 1900 GB NVMe SSD | -| `m6id.12xlarge` | 192.0 GiB | 48 vCPUs | 24 | 2850 GB (2 \* 1425 GB NVMe SSD) | -| `m6id.16xlarge` | 256.0 GiB | 64 vCPUs | 32 | 3800 GB (2 \* 1900 GB NVMe SSD) | -| `m6id.24xlarge` | 384.0 GiB | 96 vCPUs | 48 | 5700 GB (4 \* 1425 GB NVMe SSD) | -| `m6id.32xlarge` | 512.0 GiB | 128 vCPUs | 64 | 7600 GB (4 \* 1900 GB NVMe SSD) | - -## Available benchmarking datasets - -When you provision a Lakehouse node, it comes pre-configured to point to a public -S3 bucket in its same region, containing sample benchmarking datasets. - -You can query tables in these datasets by referencing them with their schema -name. - -| Schema Name | Dataset | -| --------------- | ---------------------------- | -| `tpcds_sf_1` | TPC-DS, Scale Factor 1 | -| `tpcds_sf_10` | TPC-DS, Scale Factor 10 | -| `tpcds_sf_100` | TPC-DS, Scale Factor 100 | -| `tpcds_sf_1000` | TPC-DS, Scale Factor 1000 | -| `tpch_sf_1` | TPC-H, Scale Factor 1 | -| `tpch_sf_10` | TPC-H, Scale Factor 10 | -| `tpch_sf_100` | TPC-H, Scale Factor 100 | -| `tpch_sf_1000` | TPC-H, Scale Factor 1000 | -| `clickbench` | ClickBench, 100 million rows | -| `brc_1b` | Billion row challenge | - -!!!note Notes about ClickBench data: - -Data columns (`EventData`) are integers, not dates. - -You must quote ClickBench column names, because they contain uppercase letters, -but unquoted identifiers in Postgres are case-insensitive. For example: - -✅ `select "Title" from clickbench.hits;` - -🚫 `select Title from clickbench.hits;` -!!! - -## User management - -When you provision a Lakehouse node, you must provide a password. We do not -save this password. You will need it to login as the `edb_admin` user. This is -not a superuser account, but it does have the ability to create users and roles -and grants. Thus, you can either share the credentials for `edb_admin` itself, -or you can create other users and distribute those. - -## Gotcha: Do not set `search_path` - -Do not set `search_path`. Always reference fully qualified table names. - -Using `search_path` makes Postgres Lakehouse fall back to PostgreSQL, -dramatically impacting query performance. To avoid this, qualify all table names -in your query with a schema. - -For example: - -**🚫 Do NOT do this!** - -```sql ---- DO NOT DO THIS -SET search_path = tpch_sf_10; -SELECT COUNT(*) FROM lineitem; -``` - -**✅ Do this instead!** - -```sql -SELECT COUNT(*) FROM tpch_sf_10.lineitem -``` - -## Supported queries - -In general, **READ ONLY** queries are supported. You cannot write directly to -object storage. This includes all Postgres built-in functions, statements -and types. It also includes any of those provided by EPAS or PGE, depending on -which distribution you choose to deploy. - -In general, you cannot insert, update, delete or otherwise modify data. You -cannot `CREATE TABLE`. You must load data into the bucket out-of-band, either -with your own ETL scripts or with Lakehouse Sync. See "Advanced: Bring Your Own -Data" for more details. (In the future, we will be making this more usable with -a custom DDL). - -One exception is Postgres system tables, such as those used for storing users, -roles, and grants. These tables are stored on the local block device, which is -included in backups and restores. So you can `CREATE USER` or `CREATE ROLE` or -`GRANT USAGE`, and these users/roles/grants will survive restarts and restores. - -## DirectScan vs. fallback modes and EXPLAIN - -Postgres Lakehouse is fastest when it can "push down" your entire query to -DataFusion, the vectorized query used for handling queries when possible. (In the -future, this will be more fine-grained as we add support for partial pushdowns.) - -Postgres Lakehouse can execute your query in two modes. First, it attempts to -run the entire query using Seafowl (a dedicated columnar database based on -DataFusion). If Seafowl can't run the entire query, for example, because it -uses PostgreSQL-specific operations like JSON, then Postgres Lakehouse will fall -back to using the PostgreSQL executor, with Seafowl streaming full table -contents to it. - -If your query is extremely slow, it's possible that's what's happening. - -You can check which mode is being used by running an `EXPLAIN` on the query and -making sure that the top-most query node is `SeafowlDirectScan`. For example: - -``` -explain select count from (select count(*) from tpch_sf_1.lineitem); - QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- - Aggregate (cost=167.52..167.55 rows=1 width=8) - -> Append (cost=0.00..165.01 rows=1001 width=0) - -> Seq Scan on lineitem lineitem_1 (cost=0.00..0.00 rows=1 width=0) - -> SeafowlScan on "16529" lineitem_2 (cost=100.00..150.00 rows=1000 width=0) - SeafowlPlan: logical_plan - TableScan: tpch_sf_1.lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] -(6 rows) -``` - - -In this case, the query is executed by PostgreSQL and Seafowl is only involved -when scanning the table (see `SeafowlScan` at the bottom). The fix in this case is -to explicitly name the inner `COUNT(*)` column, since Seafowl gives it an implicit -name `count(*)` whereas PostgreSQL calls it `count`: - - -``` -edb_admin=> explain select count from (select count(*) as count from tpch_sf_1.lineitem); - QUERY PLAN --------------------------------------------------------------------- - SeafowlDirectScan: logical_plan - Projection: COUNT(*) AS count - Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]] - TableScan: tpch_sf_1.lineitem projection=[] -(4 rows) -``` - -Here, we can see the `SeafowlDirectScan` at the top, which means that Seafowl is -running the entire query. - -If you're having trouble rewording your query to make it run fully on Seafowl, -open a support ticket. - -## Load data with Lakehouse sync - -If you have a transactional database running in EDB Postgres AI Cloud Service, -then you can sync tables from this database into a Managed Storage Location. - -A more detailed guide for this is forthcoming. If you want to try it yourself, -look in the UI for "Migrations" or "Sync to Lakehouse." - -## Advanced: Bring your own data - -It's possible to point your Lakehouse node at an arbitrary S3 bucket with Delta -Tables inside of it. However, this comes with some major caveats (which will -eventually be resolved): - -### Caveats - -* The bucket must be publicly accessible. - * If you want to use a private bucket, this is technically possible, but -requires some manual action on our side and your side (to assign the correct -IAM policies). Let us know if you want to try it. We will be adding -proper support for private, external buckets in the near future. -* The tables must be stored as [Delta Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location -* A “Delta Table” is a folder of Parquet files along with some JSON metadata. -* Each table must be prefixed with a `$schema/$table/` where `$schema` and `$table` are valid Postgres identifiers (i.e. < 64 characters) - * For example, this is a valid Delta Table that will be recognized by Beacon Analytics: - * `my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}` - * These `$schema` and `$table` identifiers will be queryable in the Lakehouse node, e.g.: - * `SELECT count(*) FROM my_schema.my_table;` - * This Delta Table will NOT be recognized by Beacon Analytics (missing a schema): - * `my_table/{part1.parquet, part2.parquet, _delta_log}` - - -### Loading your own data - -* You can use the [deltalake](https://pypi.org/project/deltalake/) Python library -to create Delta Tables and write to the bucket -* You can also use the [`lakehouse-loader`](https://github.com/splitgraph/lakehouse-loader) utility -we created for this, to export data from an arbitrary Postgres instance to Lakehouse Tables -in a storage bucket. - -For example, with the `lakehouse-loader` utility: - -```bash -export PGPASSWORD="..." -export AWS_ACCESS_KEY_ID="..." -export AWS_SECRET_ACCESS_KEY="..." -# export other AWS envvars - -./lakehouse-loader postgres-to-delta postgres://test-user@localhost:5432/test-db -q "SELECT * FROM some_table" s3://my-bucket/my_schema/my_table -``` - -### Pointing to your bucket - -By default, each Lakehouse node is configured to point to a bucket with -benchmarking datasets inside. To point it to a different bucket, you can -call the `seafowl.set_bucket_location` function: - -```sql -SELECT seafowl.set_bucket_location('{"region": "ap-south-1", "bucket": "my-bucket", "public": true}'); -``` - -### Querying your own data - -In the example above, after you've called `set_bucket_location`, you will be able -to query data in `my_schema.my_table`: - -```sql -SELECT * FROM some_table; -``` - -Note that using an S3 bucket that isn't in the same region as your node -will 1) be slow because of cross-region latencies, and 2) will incur -AWS costs (between $0.01 and $0.02 / GB) for data transfer! Currently these -egress costs are not passed through to you but we do track them and reserve -the right to terminate an instance. - -### Switching back to sample data - -To switch the bucket back to the default sample bucket in the same region as your node: - -```sql -SELECT seafowl.set_bucket_location(NULL) -``` - diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx new file mode 100644 index 00000000000..d4b95cf309f --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx @@ -0,0 +1,35 @@ +--- +title: Benchmarking Datasets +description: Benchmarking datasets available for Lakehouse +--- + +When you provision a Lakehouse node, it comes pre-configured to point to a public +S3 bucket in its same region, containing sample benchmarking datasets. + +You can query tables in these datasets by referencing them with their schema +name. + +| Schema Name | Dataset | +| --------------- | ---------------------------- | +| `tpcds_sf_1` | TPC-DS, Scale Factor 1 | +| `tpcds_sf_10` | TPC-DS, Scale Factor 10 | +| `tpcds_sf_100` | TPC-DS, Scale Factor 100 | +| `tpcds_sf_1000` | TPC-DS, Scale Factor 1000 | +| `tpch_sf_1` | TPC-H, Scale Factor 1 | +| `tpch_sf_10` | TPC-H, Scale Factor 10 | +| `tpch_sf_100` | TPC-H, Scale Factor 100 | +| `tpch_sf_1000` | TPC-H, Scale Factor 1000 | +| `clickbench` | ClickBench, 100 million rows | +| `brc_1b` | Billion row challenge | + +!!!note Notes about ClickBench data: + +Data columns (`EventData`) are integers, not dates. + +You must quote ClickBench column names, because they contain uppercase letters, +but unquoted identifiers in Postgres are case-insensitive. For example: + +✅ `select "Title" from clickbench.hits;` + +🚫 `select Title from clickbench.hits;` +!!! diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx new file mode 100644 index 00000000000..540540b7bf2 --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx @@ -0,0 +1,25 @@ +--- +title: Delta Table tools +navTitle: Delta Table tools +description: Tools for working with Delta Tables +--- + +## Creating Delta Tables + +* You can use the [deltalake](https://pypi.org/project/deltalake/) Python library +to create Delta Tables and write to the bucket +* You can also use the [`lakehouse-loader`](https://github.com/splitgraph/lakehouse-loader) utility +we created for this, to export data from an arbitrary Postgres instance to Lakehouse Tables +in a storage bucket. + +For example, with the `lakehouse-loader` utility: + +```bash +export PGPASSWORD="..." +export AWS_ACCESS_KEY_ID="..." +export AWS_SECRET_ACCESS_KEY="..." +# export other AWS envvars + +./lakehouse-loader postgres-to-delta postgres://test-user@localhost:5432/test-db -q "SELECT * FROM some_table" s3://my-bucket/my_schema/my_table +``` + diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx new file mode 100644 index 00000000000..f1636c60902 --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx @@ -0,0 +1,57 @@ +--- +title: DirectScan, fallback modes, and EXPLAIN +navTitle: DirectScan +description: Lakehouse is fastest when it can "push down" your entire query to DataFusion. This explains how to check if your query is running in DirectScan mode. +--- + +Postgres Lakehouse is fastest when it can "push down" your entire query to +DataFusion, the vectorized query used for handling queries when possible. (In the +future, this will be more fine-grained as we add support for partial pushdowns.) + +Postgres Lakehouse can execute your query in two modes. First, it attempts to +run the entire query using Seafowl (a dedicated columnar database based on +DataFusion). If Seafowl can't run the entire query, for example, because it +uses PostgreSQL-specific operations like JSON, then Postgres Lakehouse will fall +back to using the PostgreSQL executor, with Seafowl streaming full table +contents to it. + +If your query is extremely slow, it's possible that's what's happening. + +You can check which mode is being used by running an `EXPLAIN` on the query and +making sure that the top-most query node is `SeafowlDirectScan`. For example: + +``` +explain select count from (select count(*) from tpch_sf_1.lineitem); + QUERY PLAN +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ + Aggregate (cost=167.52..167.55 rows=1 width=8) + -> Append (cost=0.00..165.01 rows=1001 width=0) + -> Seq Scan on lineitem lineitem_1 (cost=0.00..0.00 rows=1 width=0) + -> SeafowlScan on "16529" lineitem_2 (cost=100.00..150.00 rows=1000 width=0) + SeafowlPlan: logical_plan + TableScan: tpch_sf_1.lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] +(6 rows) +``` + + +In this case, the query is executed by PostgreSQL and Seafowl is only involved +when scanning the table (see `SeafowlScan` at the bottom). The fix in this case is +to explicitly name the inner `COUNT(*)` column, since Seafowl gives it an implicit +name `count(*)` whereas PostgreSQL calls it `count`: + + +``` +edb_admin=> explain select count from (select count(*) as count from tpch_sf_1.lineitem); + QUERY PLAN +-------------------------------------------------------------------- + SeafowlDirectScan: logical_plan + Projection: COUNT(*) AS count + Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]] + TableScan: tpch_sf_1.lineitem projection=[] +(4 rows) +``` + +Here, we can see the `SeafowlDirectScan` at the top, which means that Seafowl is +running the entire query. + +If you're having trouble rewording your query to make it run fully on Seafowl, open a support ticket. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx new file mode 100644 index 00000000000..59f09f55906 --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx @@ -0,0 +1,26 @@ +--- +title: Reference - EDB Postgres® AI Lakehouse +navTitle: Reference +description: Things to know about EDB Postgres® AI Lakehouse +deepToC: true +navigation: +- providers_and_regions +- instances +- datasets +- queries +- deltatables +- directscan +- users +- loading_data +--- + +EDB Postgres® AI Lakehouse is an early product. Eventually, it will support deployment +modes across multiple clouds and on-premises. However, currently it's fairly +limited in terms of where you can deploy it and what data you can query with it. + +To get the best experience with Lakehouse, you should follow the +"quick start" guide to query benchmarking data. Then you can try loading your +own data with Lakehouse Sync. If you're intrigued, reach out to us and +we can talk more about your use case and potential opportunities. + +This section details some of the important things you should know. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx new file mode 100644 index 00000000000..c5a12cef84d --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx @@ -0,0 +1,31 @@ +--- +title: Supported AWS Instances +description: Supported AWS instances for Lakehouse +--- + +When deploying a Lakehouse node, you must choose an instance type from +the `m6id` family of instances. Importantly, these instances come with NVMe +drives attached to them. + +**Instances are ephemeral.** These NVMe drives are used only for spill-out space +*while processing queries, and for caching Delta Tables on disk. +All data on the NVMe drives will be lost when the cluster is shutdown. + +**System tables are persisted.** Persistent data in system tables (users, roles, +*etc.) is stored in an attached +block storage device, and will survive a pause/resume cycle. + +**Supported instances** + +| API Name | Memory | vCPUs | Cores | Storage | +| --------------- | --------- | --------- | ----- | ------------------------------- | +| `m6id.large` | 8.0 GiB | 2 vCPUs | 1 | 118 GB NVMe SSD | +| `m6id.xlarge` | 16.0 GiB | 4 vCPUs | 2 | 237 GB NVMe SSD | +| `m6id.2xlarge` | 32.0 GiB | 8 vCPUs | 4 | 474 GB NVMe SSD | +| `m6id.4xlarge` | 64.0 GiB | 16 vCPUs | 8 | 950 GB NVMe SSD | +| `m6id.8xlarge` | 128.0 GiB | 32 vCPUs | 16 | 1900 GB NVMe SSD | +| `m6id.12xlarge` | 192.0 GiB | 48 vCPUs | 24 | 2850 GB (2 \* 1425 GB NVMe SSD) | +| `m6id.16xlarge` | 256.0 GiB | 64 vCPUs | 32 | 3800 GB (2 \* 1900 GB NVMe SSD) | +| `m6id.24xlarge` | 384.0 GiB | 96 vCPUs | 48 | 5700 GB (4 \* 1425 GB NVMe SSD) | +| `m6id.32xlarge` | 512.0 GiB | 128 vCPUs | 64 | 7600 GB (4 \* 1900 GB NVMe SSD) | + diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx new file mode 100644 index 00000000000..f0692ed191f --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx @@ -0,0 +1,65 @@ +--- +title: Loading data (sync or bring your own) +navTitle: Loading data +description: How to load data into Lakehouse +--- + +## Loading data with Lakehouse sync + +If you have a transactional database running in EDB Postgres AI Cloud Service, +then you can sync tables from this database into a Managed Storage Location. + +A more detailed guide for this is forthcoming. If you want to try it yourself, +see ["How to lakehouse sync"](../how_to_lakehouse_sync). + +## Bringing your own data + +It's possible to point your Lakehouse node at an arbitrary S3 bucket with Delta +Tables inside of it. However, this comes with some major caveats (which will +eventually be resolved): + +### Caveats + +* The tables must be stored as [Delta Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location +* A “Delta Table” is a folder of Parquet files along with some JSON metadata. +* Each table must be prefixed with a `$schema/$table/` where `$schema` and `$table` are valid Postgres identifiers (i.e. < 64 characters) + * For example, this is a valid Delta Table that will be recognized by Beacon Analytics: + * `my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}` + * These `$schema` and `$table` identifiers will be queryable in the Lakehouse node, e.g.: + * `SELECT count(*) FROM my_schema.my_table;` + * This Delta Table will NOT be recognized by Beacon Analytics (missing a schema): + * `my_table/{part1.parquet, part2.parquet, _delta_log}` + + + + +### Pointing to your bucket + +By default, each Lakehouse node is configured to point to a bucket with +benchmarking datasets inside. To point it to a different bucket, you can +call the `pgaa.create_storage_location` function: + +```sql +SELECT pgaa.create_storage_location.set_bucket_location('mystore', 's3://my-bucket'); +``` + +### Querying your own data + +In the example above, after you've called `pgaa.create_stn`, you will be able +to query data in `my_schema.my_table`: + +```sql +CREATE TABLE public.tablename () USING PGAA WITH (pgaa.storage_location = 'mystore', pgaa.path = 'schemaname/tablename'); +``` + +Then you can query the table: + +```sql +SELECT COUNT(*) FROM public.tablename; +``` + +Note that using an S3 bucket that isn't in the same region as your node +will 1) be slow because of cross-region latencies, and 2) will incur +AWS costs (between $0.01 and $0.02 / GB) for data transfer! Currently these +egress costs are not passed through to you but we do track them and reserve +the right to terminate an instance. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx new file mode 100644 index 00000000000..2298ca2abc9 --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx @@ -0,0 +1,36 @@ +--- +title: Supported cloud providers and regions +description: Supported cloud providers and regions for Lakehouse +--- + +**AWS only**: Currently, support for all Lakehouse features (Lakehouse nodes, +Managed Storage Locations, and Lakehouse Sync) is limited to AWS. + +**EDB-hosted only**: "Bring Your Own Account" (BYOA) regions are NOT currently +supported for Lakehouse resources. Support is limited to +ONLY **EDB Postgres® AI Hosted** environments on AWS (a.k.a. "EDB-Hosted AWS regions"). + +This means you can select from one of the following regions: + +* North America + * US East 1 + * US East 2 + * US West 2 +* Europe + * EU Central 1 + * EU West 1 + * EU West 2 +* Asia + * AP South 1 +* Australia + * AP SouthEast 2 + +To be precise: + +* Lakehouse nodes can only be provisioned in EDB-hosted AWS regions +* Managed Storage Locations can only be created in EDB-hosted AWS regions +* Lakehouse Sync can only sync from source databases in EDB-hosted AWS regions + +These limitations will be removed as we continue to improve the product. Eventually, +we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. We +will also add better support for "external" buckets ("bring your own bucket"). diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx new file mode 100644 index 00000000000..a6fd3d21adb --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx @@ -0,0 +1,44 @@ +--- +title: Queries +description: Supported queries in Lakehouse and best practices when composing them +--- + +In general, **READ ONLY** queries are supported. You cannot write directly to +object storage. This includes all Postgres built-in functions, statements +and types. It also includes any of those provided by EPAS or PGE, depending on +which distribution you choose to deploy. + +In general, you cannot insert, update, delete or otherwise modify data. You +cannot `CREATE TABLE`. You must load data into the bucket out-of-band, either +with your own ETL scripts or with Lakehouse Sync. See "Advanced: Bring Your Own +Data" for more details. (In the future, we will be making this more usable with +a custom DDL). + +One exception is Postgres system tables, such as those used for storing users, +roles, and grants. These tables are stored on the local block device, which is +included in backups and restores. So you can `CREATE USER` or `CREATE ROLE` or +`GRANT USAGE`, and these users/roles/grants will survive restarts and restores. + +## Gotcha: Do not set `search_path` + +Do not set `search_path`. Always reference fully qualified table names. + +Using `search_path` makes Lakehouse fall back to PostgreSQL, +dramatically impacting query performance. To avoid this, qualify all table names +in your query with a schema. + +For example: + +**🚫 Do NOT do this!** + +```sql +--- DO NOT DO THIS +SET search_path = tpch_sf_10; +SELECT COUNT(*) FROM lineitem; +``` + +**✅ Do this instead!** + +```sql +SELECT COUNT(*) FROM tpch_sf_10.lineitem +``` diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/users.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/users.mdx new file mode 100644 index 00000000000..8ae12e6e06d --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/users.mdx @@ -0,0 +1,10 @@ +--- +title: User management +description: Managing users in Lakehouse +--- + +When you provision a Lakehouse node, you must provide a password. We do not +save this password. You will need it to login as the `edb_admin` user. This is +not a superuser account, but it does have the ability to create users and roles +and grants. Thus, you can either share the credentials for `edb_admin` itself, +or you can create other users and distribute those. From dd21307f0718a5f6447fb5bb5cb5f22c66ffebd8 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan Date: Mon, 4 Nov 2024 10:27:08 +0000 Subject: [PATCH 19/25] Revisions as per review comments. Signed-off-by: Dj Walker-Morgan --- .../analytics/external_tables.mdx | 42 +++++++++++---- .../analytics/reference/deltatables.mdx | 19 ++++--- .../analytics/reference/directscan.mdx | 19 +++---- .../analytics/reference/functions.mdx | 36 +++++++++++++ .../analytics/reference/index.mdx | 13 +++-- .../analytics/reference/loadingdata.mdx | 46 +++++++--------- .../reference/providers_and_regions.mdx | 3 +- .../analytics/reference/queries.mdx | 52 +++++-------------- 8 files changed, 123 insertions(+), 107 deletions(-) create mode 100644 advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx diff --git a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx index f31e0537a1c..c83b8d9cd86 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx @@ -1,7 +1,7 @@ --- -title: Querying Delta tables in S3-compatible object storage +title: Querying Delta Lake Tables in S3-compatible object storage navTitle: External Tables -description: Access and Query data in S3-compatible object storage using External Tables +description: Access and Query data stored as Delta Lake Tablles in S3-compatible object storage using External Tables deepToC: true --- @@ -11,28 +11,52 @@ External tables allow you to access and query data stored in S3-compatible objec ## Prerequisites -An EDB Postgres AI account and a Lakehouse node. +* An EDB Postgres AI account and a Lakehouse node. +* An S3-compatible object storage location with data stored as Delta Lake Tables. + * See [Bringing your own data](../loadingdata) for more information on how to prepare your data. +* Credentials to access the S3-compatible object storage location, unless it is a public bucket. + * These credentials will be stored within the database. We recommend creating a separate user with limited permissions for this purpose. + +!!! Note Regions, latency and cost +Using an S3 bucket that isn't in the same region as your node will + +* be slow because of cross-region latencies +* will incur AWS costs (between $0.01 and $0.02 / GB) for data transfer. Currently these egress costs are not passed through to you but we do track them and reserve the right to terminate an instance. +!!! ## Creating an External Storage Location -The first step is to create an external storage location which references S3-compatible object storage where your data resides. This is done by using SQL to execute the `pgaa.create_storage_location` function. `pgaa` is the name of the extension and namespace that provides the functionality to query external storage locations. +The first step is to create an external storage location which references S3-compatible object storage where your data resides. A storage location is an object within the database which you refer to to access the data; each storage location has a name for this purpose. + +Creating a named storage location is performed with SQL by executing the `pgaa.create_storage_location` function. +`pgaa` is the name of the extension and namespace that provides the functionality to query external storage locations. +The `create_storage_location` function takes a name for the new storage location, and the URI of the S3-compatible object storage location as parameters. +The function optionally can take a third parameter, `options`, which is a JSON object for specifying optional settings, detailed in the [functions reference](reference/functions#create_storage_location). +For example, in the options, you can specify the access key ID and secret access key for the storage location to enable access to a private bucket. -The following example creates an external table that references an S3-compatible object storage location: +The following example creates an external table that references a public S3-compatible object storage location: ```sql SELECT pgaa.create_storage_location('sample-data', 's3://pgaa-sample-data-eu-west-1'); ``` -The function takes a name for the new storage location, and the URI of the S3-compatible object storage location. +The next example creates an external storage location that references a private S3-compatible object storage location: + +```sql +SELECT pgaa.create_storage_location('private-data', 's3://my-private-bucket', '{"access_key_id": "my-access-key-id","secret_access_key": "my-secret-access-key"}'); +``` ## Creating an External Table -After creating the external storage location, you can create an external table that references the data in the storage location. The following example creates an external table that references a Delta table in the S3-compatible object storage location: +After creating the external storage location, you can create an external table that references the data in the storage location. +The following example creates an external table that references a Delta Lake Table in the S3-compatible object storage location: ```sql CREATE TABLE public.customer () USING PGAA WITH (pgaa.storage_location = 'sample-data', pgaa.path = 'tpch_sf_1/customer'); ``` +Note that the schema is not defined in the `CREATE TABLE` statement. The pgaa extension expects the schema to be defined in the storage location, and the schema itself is derived from the schema stored at the path specified in the `pgaa.path` option. The pgaa extension will infer the best Postgres-equivelant data types for the columns in the Delta Table. + ## Querying an External Table After creating the external table, you can query the data in the external table using standard SQL commands. The following example queries the external table created in the previous step: @@ -40,7 +64,3 @@ After creating the external table, you can query the data in the external table ```sql SELECT COUNT(*) FROM public.customer; ``` - - - - diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx index 540540b7bf2..06ad28b7279 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx @@ -1,16 +1,18 @@ --- -title: Delta Table tools +title: Delta Lake Table tools navTitle: Delta Table tools -description: Tools for working with Delta Tables +description: Tools for working with Delta Lake Tables --- -## Creating Delta Tables +## Creating Delta Lake Tables -* You can use the [deltalake](https://pypi.org/project/deltalake/) Python library -to create Delta Tables and write to the bucket -* You can also use the [`lakehouse-loader`](https://github.com/splitgraph/lakehouse-loader) utility -we created for this, to export data from an arbitrary Postgres instance to Lakehouse Tables -in a storage bucket. +### Using the `deltalake` Python library + +You can use the [deltalake](https://pypi.org/project/deltalake/) Python library to create Delta Tables and write to the bucket + +### Using the `lakehouse-loader` utility + +You can also use the [`lakehouse-loader`](https://github.com/splitgraph/lakehouse-loader) utility that EDB has created for this task, to export data from an arbitrary Postgres instance to Lakehouse Tables in a storage bucket. For example, with the `lakehouse-loader` utility: @@ -23,3 +25,4 @@ export AWS_SECRET_ACCESS_KEY="..." ./lakehouse-loader postgres-to-delta postgres://test-user@localhost:5432/test-db -q "SELECT * FROM some_table" s3://my-bucket/my_schema/my_table ``` +This will export the data from the `some_table` table in the `test-db` database to a Delta Table in the `my_schema/my_table` path in the `my-bucket` bucket. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx index f1636c60902..5ae5533fda5 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/directscan.mdx @@ -18,7 +18,7 @@ contents to it. If your query is extremely slow, it's possible that's what's happening. You can check which mode is being used by running an `EXPLAIN` on the query and -making sure that the top-most query node is `SeafowlDirectScan`. For example: +making sure that the top-most query node is `DirectScan`. For example: ``` explain select count from (select count(*) from tpch_sf_1.lineitem); @@ -26,32 +26,27 @@ explain select count from (select count(*) from tpch_sf_1.lineitem); ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=167.52..167.55 rows=1 width=8) -> Append (cost=0.00..165.01 rows=1001 width=0) - -> Seq Scan on lineitem lineitem_1 (cost=0.00..0.00 rows=1 width=0) - -> SeafowlScan on "16529" lineitem_2 (cost=100.00..150.00 rows=1000 width=0) + -> CompatScan on "16529" lineitem_2 (cost=100.00..150.00 rows=1000 width=0) SeafowlPlan: logical_plan TableScan: tpch_sf_1.lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] (6 rows) ``` -In this case, the query is executed by PostgreSQL and Seafowl is only involved -when scanning the table (see `SeafowlScan` at the bottom). The fix in this case is -to explicitly name the inner `COUNT(*)` column, since Seafowl gives it an implicit -name `count(*)` whereas PostgreSQL calls it `count`: +In this case, the query is executed by PostgreSQL and Seafowl is only involved when scanning the table (see `CompatScan` at the bottom). +The fix in this case is to explicitly name the inner `COUNT(*)` column, since Seafowl gives it an implicit name `count(*)` whereas PostgreSQL calls it `count`: - -``` +```console edb_admin=> explain select count from (select count(*) as count from tpch_sf_1.lineitem); QUERY PLAN -------------------------------------------------------------------- - SeafowlDirectScan: logical_plan + DirectScan: logical_plan Projection: COUNT(*) AS count Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]] TableScan: tpch_sf_1.lineitem projection=[] (4 rows) ``` -Here, we can see the `SeafowlDirectScan` at the top, which means that Seafowl is -running the entire query. +Here, we can see the `DirectScan` at the top, which means that Seafowl is running the entire query. If you're having trouble rewording your query to make it run fully on Seafowl, open a support ticket. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx new file mode 100644 index 00000000000..80890fb88bf --- /dev/null +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx @@ -0,0 +1,36 @@ +--- +title: PGAA functions reference +navTitle: Functions +description: Reference for the functions provided by the PGAA extension +--- + + +## `pgaa.create_storage_location` + +### Synopsis + +Creates a new storage location that references an S3-compatible object storage location. + +### Parameters + +| Parameter | Type | Description | +| --- | --- | --- | +| `name` | `text` | The name of the storage location | +| `uri` | `text` | The URI of the S3-compatible object storage location | +| `options` | `json` | Optional settings for the storage location | + +#### Options + +| Option | Type | Description | +| --- | --- | --- | +| `access_key_id` | `text` | The access key ID for the storage location | +| `secret_access_key` | `text` | The secret access key for the storage location | +| `session_token` | `text` | The session token for the storage location | +| `region` | `text` | The region for the storage location | +| `endpoint` | `text` | The endpoint for the storage location | +| `bucket` | `text` | The bucket for the storage location | +| `use_http` | `boolean` | Use HTTP instead of HTTPS for the storage location | +| `skip_signature` | `boolean` | Skip signature verification for the storage location | + + + diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx index 59f09f55906..bc01edb4a9c 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/index.mdx @@ -9,18 +9,17 @@ navigation: - datasets - queries - deltatables +- functions - directscan - users - loading_data --- -EDB Postgres® AI Lakehouse is an early product. Eventually, it will support deployment -modes across multiple clouds and on-premises. However, currently it's fairly -limited in terms of where you can deploy it and what data you can query with it. +EDB Postgres® AI Lakehouse is an early product. Eventually, it will support deployment modes across multiple clouds and on-premises. +However, currently it's fairly limited in terms of where you can deploy it and what data you can query with it. -To get the best experience with Lakehouse, you should follow the -"quick start" guide to query benchmarking data. Then you can try loading your -own data with Lakehouse Sync. If you're intrigued, reach out to us and -we can talk more about your use case and potential opportunities. +To get the best experience with Lakehouse, you should follow the [Quick start](../quick_start) to query benchmarking data. +Then you can try loading your own data with Lakehouse Sync or you can bring your own data and use [external tables](../external_tables). +If you're intrigued, reach out to us and we can talk more about your use case and potential opportunities. This section details some of the important things you should know. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx index f0692ed191f..9ff4a13b112 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx @@ -6,60 +6,50 @@ description: How to load data into Lakehouse ## Loading data with Lakehouse sync -If you have a transactional database running in EDB Postgres AI Cloud Service, -then you can sync tables from this database into a Managed Storage Location. - -A more detailed guide for this is forthcoming. If you want to try it yourself, -see ["How to lakehouse sync"](../how_to_lakehouse_sync). +If you have a transactional database running in EDB Postgres AI Cloud Service, then you can sync tables from this database into a Managed Storage Location. See ["How to lakehouse sync"](../how_to_lakehouse_sync) for further details. ## Bringing your own data -It's possible to point your Lakehouse node at an arbitrary S3 bucket with Delta -Tables inside of it. However, this comes with some major caveats (which will -eventually be resolved): +It's possible to point your Lakehouse node at an arbitrary S3 bucket with Delta Tables inside of it. +However, this comes with some major caveats (which will eventually be resolved): ### Caveats -* The tables must be stored as [Delta Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location -* A “Delta Table” is a folder of Parquet files along with some JSON metadata. +* The tables must be stored as [Delta Lake Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location +* A "Delta Lake Table" (or "Delta Table") is a folder of Parquet files along with some JSON metadata. * Each table must be prefixed with a `$schema/$table/` where `$schema` and `$table` are valid Postgres identifiers (i.e. < 64 characters) * For example, this is a valid Delta Table that will be recognized by Beacon Analytics: * `my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}` - * These `$schema` and `$table` identifiers will be queryable in the Lakehouse node, e.g.: - * `SELECT count(*) FROM my_schema.my_table;` - * This Delta Table will NOT be recognized by Beacon Analytics (missing a schema): + * These `$schema` and `$table` identifiers will be queryable in the Lakehouse node, e.g.: + * `SELECT count(*) FROM my_schema.my_table;` + * This Delta Table will NOT be recognized by Lakehouse Analytics (missing a schema): * `my_table/{part1.parquet, part2.parquet, _delta_log}` +### Loading data into your bucket +You can use the `lakehouse-loader` utility to export data from an arbitrary Postgres instance to Delta Tables in a storage bucket. +See [Delta Lake Table Tools](delta_tables) for more information on how to obtain and use that utility. +### Querying your own data -### Pointing to your bucket - -By default, each Lakehouse node is configured to point to a bucket with -benchmarking datasets inside. To point it to a different bucket, you can -call the `pgaa.create_storage_location` function: +By default, each Lakehouse node is configured to point to a bucket with benchmarking datasets inside. +To point it to a different bucket, you can call the `pgaa.create_storage_location` function: ```sql SELECT pgaa.create_storage_location.set_bucket_location('mystore', 's3://my-bucket'); ``` -### Querying your own data - -In the example above, after you've called `pgaa.create_stn`, you will be able -to query data in `my_schema.my_table`: +You will then be able create a table that references the Delta Table in the bucket: ```sql CREATE TABLE public.tablename () USING PGAA WITH (pgaa.storage_location = 'mystore', pgaa.path = 'schemaname/tablename'); ``` -Then you can query the table: +Which you can then query: ```sql SELECT COUNT(*) FROM public.tablename; ``` -Note that using an S3 bucket that isn't in the same region as your node -will 1) be slow because of cross-region latencies, and 2) will incur -AWS costs (between $0.01 and $0.02 / GB) for data transfer! Currently these -egress costs are not passed through to you but we do track them and reserve -the right to terminate an instance. +For further details, see the [External Tables](../external_tables) documentation. + diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx index 2298ca2abc9..bd56a928a75 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx @@ -32,5 +32,4 @@ To be precise: * Lakehouse Sync can only sync from source databases in EDB-hosted AWS regions These limitations will be removed as we continue to improve the product. Eventually, -we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. We -will also add better support for "external" buckets ("bring your own bucket"). +we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx index a6fd3d21adb..42e0135c529 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/queries.mdx @@ -3,42 +3,16 @@ title: Queries description: Supported queries in Lakehouse and best practices when composing them --- -In general, **READ ONLY** queries are supported. You cannot write directly to -object storage. This includes all Postgres built-in functions, statements -and types. It also includes any of those provided by EPAS or PGE, depending on -which distribution you choose to deploy. - -In general, you cannot insert, update, delete or otherwise modify data. You -cannot `CREATE TABLE`. You must load data into the bucket out-of-band, either -with your own ETL scripts or with Lakehouse Sync. See "Advanced: Bring Your Own -Data" for more details. (In the future, we will be making this more usable with -a custom DDL). - -One exception is Postgres system tables, such as those used for storing users, -roles, and grants. These tables are stored on the local block device, which is -included in backups and restores. So you can `CREATE USER` or `CREATE ROLE` or -`GRANT USAGE`, and these users/roles/grants will survive restarts and restores. - -## Gotcha: Do not set `search_path` - -Do not set `search_path`. Always reference fully qualified table names. - -Using `search_path` makes Lakehouse fall back to PostgreSQL, -dramatically impacting query performance. To avoid this, qualify all table names -in your query with a schema. - -For example: - -**🚫 Do NOT do this!** - -```sql ---- DO NOT DO THIS -SET search_path = tpch_sf_10; -SELECT COUNT(*) FROM lineitem; -``` - -**✅ Do this instead!** - -```sql -SELECT COUNT(*) FROM tpch_sf_10.lineitem -``` +In general, **READ ONLY** queries are supported. You cannot write directly to object storage. +This includes all Postgres built-in functions, statements and types. +It also includes any of those provided by EPAS or PGE, depending on which distribution you choose to deploy. + +In general, you cannot insert, update, delete or otherwise modify data. +You can use `CREATE TABLE` but only to create a normal Postgres table on the node, bearing in mind that the node is ephemeral and will be destroyed when you terminate it. +You can also `CREATE TABLE... USING PGAA` to create an external table that references Delta Tables in S3-compatible object storage. +You must load data into the bucket out-of-band, either with your own ETL scripts or with Lakehouse Sync. +See [Loading Data](loadingdata) for more details. + +One exception is Postgres system tables, such as those used for storing users, roles, and grants. +These tables are stored on the local block device, which is included in backups and restores. +So you can `CREATE USER` or `CREATE ROLE` or `GRANT USAGE`, and these users/roles/grants will survive restarts and restores. From 5cce06b4aad950c6020985ba3a82912bb8a23238 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan Date: Mon, 4 Nov 2024 11:15:35 +0000 Subject: [PATCH 20/25] Fixes for linkses Signed-off-by: Dj Walker-Morgan --- .../analytics/external_tables.mdx | 6 +-- .../edb-postgres-ai/analytics/quick_start.mdx | 47 ++++++------------- .../{deltatables.mdx => delta_tables.mdx} | 2 + .../analytics/reference/loadingdata.mdx | 28 ++--------- .../reference/providers_and_regions.mdx | 3 +- 5 files changed, 24 insertions(+), 62 deletions(-) rename advocacy_docs/edb-postgres-ai/analytics/reference/{deltatables.mdx => delta_tables.mdx} (82%) diff --git a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx index c83b8d9cd86..45f71293c98 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx @@ -13,7 +13,7 @@ External tables allow you to access and query data stored in S3-compatible objec * An EDB Postgres AI account and a Lakehouse node. * An S3-compatible object storage location with data stored as Delta Lake Tables. - * See [Bringing your own data](../loadingdata) for more information on how to prepare your data. + * See [Bringing your own data](reference/loadingdata) for more information on how to prepare your data. * Credentials to access the S3-compatible object storage location, unless it is a public bucket. * These credentials will be stored within the database. We recommend creating a separate user with limited permissions for this purpose. @@ -26,12 +26,12 @@ Using an S3 bucket that isn't in the same region as your node will ## Creating an External Storage Location -The first step is to create an external storage location which references S3-compatible object storage where your data resides. A storage location is an object within the database which you refer to to access the data; each storage location has a name for this purpose. +The first step is to create an external storage location which references S3-compatible object storage where your data resides. A storage location is an object within the database which you refer to to access the data; each storage location has a name for this purpose. Creating a named storage location is performed with SQL by executing the `pgaa.create_storage_location` function. `pgaa` is the name of the extension and namespace that provides the functionality to query external storage locations. The `create_storage_location` function takes a name for the new storage location, and the URI of the S3-compatible object storage location as parameters. -The function optionally can take a third parameter, `options`, which is a JSON object for specifying optional settings, detailed in the [functions reference](reference/functions#create_storage_location). +The function optionally can take a third parameter, `options`, which is a JSON object for specifying optional settings, detailed in the [functions reference](reference/functions#pgaacreate_storage_location). For example, in the options, you can specify the access key ID and secret access key for the storage location to enable access to a private bucket. The following example creates an external table that references a public S3-compatible object storage location: diff --git a/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx b/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx index df985fb9289..398ee3c3acb 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx @@ -81,50 +81,33 @@ Persistent data in system tables (users, roles, etc) is stored in an attached block storage device and will survive a restart or backup/restore cycle. * Only Postgres 16 is supported. -For more notes about supported instance sizes, -see [Reference - Supported AWS instances](./reference/#supported-aws-instances). +For more notes about supported instance sizes,see [Reference - Supported AWS instances](./reference/instances). ## Operating a Lakehouse node ### Connect to the node -You can connect to the Lakehouse node with any Postgres client, in the same way -that you connect to any other cluster from EDB Postgres AI Cloud Service -(formerly known as BigAnimal): navigate to the cluster detail page and copy its -connection string. +You can connect to the Lakehouse node with any Postgres client, in the same way that you connect to any other cluster from EDB Postgres AI Cloud Service (formerly known as BigAnimal): navigate to the cluster detail page and copy its connection string. -For example, you might copy the `.pgpass` blob into `~/.pgpass` (making sure to -replace `$YOUR_PASSWORD` with the password you provided when launching the -cluster). Then you can copy the connection string and use it as an argument to -`psql` or `pgcli`. +For example, you might copy the `.pgpass` blob into `~/.pgpass` (making sure to replace `$YOUR_PASSWORD` with the password you provided when launching the cluster). +Then you can copy the connection string and use it as an argument to `psql` or `pgcli`. -In general, you should be able to connect to the database with any Postgres -client. We expect all introspection queries to work, and if you find one that -doesn't, then that's a bug. +In general, you should be able to connect to the database with any Postgres client. +We expect all introspection queries to work, and if you find one that doesn't, then that's a bug. ### Understand the constraints -* Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those -flavors in the installation when you connect. -* Queryable data (like the benchmarking datasets) is stored in object storage -as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket -with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at -scale factors 1 and 10. +* Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those flavors in the installation when you connect. +* Queryable data (like the benchmarking datasets) is stored in object storage as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at scale factors 1 and 10. * Only AWS is supported at the moment. Bring Your Own Account (BYOA) is not supported. -* You can deploy a cluster in any region that is activated in -your EDB Postgres AI Account. Each region has a bucket with a copy of the -benchmarking data, and so when you launch a cluster, it will use the -benchmarking data in the location closest to it. -* The cluster is ephemeral. None of the data is stored on the hard drive, -except for data in system tables, e.g. roles and users and grants. -If you restart the cluster, or backup the cluster and then restore it, -it will restore these system tables. But the data in object storage will +* You can deploy a cluster in any region that is activated in your EDB Postgres AI Account. Each region has a bucket with a copy of the +benchmarking data, and so when you launch a cluster, it will use the benchmarking data in the location closest to it. +* The cluster is ephemeral. None of the data is stored on the hard drive, except for data in system tables, e.g. roles and users and grants. +If you restart the cluster, or backup the cluster and then restore it, it will restore these system tables. But the data in object storage will remain untouched. -* The cluster supports READ ONLY queries of the data in object -storage (but it supports write queries to system tables for creating users, +* The cluster supports READ ONLY queries of the data in object storage (but it supports write queries to system tables for creating users, etc.). You cannot write directly to object storage. You cannot create new tables. -* If you want to load your own data into object storage, -see [Reference - Bring your own data](./reference/#advanced-bring-your-own-data). +* If you want to load your own data into object storage, see [Reference - Bring your own data](reference/loadingdata). ## Inspect the benchmark datasets @@ -140,7 +123,7 @@ The available benchmarking datasets are: * 1 Billion Row Challenge For more details on benchmark datasets, -see Reference - Available benchmarking datasets](./reference/#available-benchmarking-datasets). +see Reference - Available benchmarking datasets](./reference/datasets). ## Query the benchmark datasets diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx similarity index 82% rename from advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx rename to advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx index 06ad28b7279..400efe5ee6d 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/deltatables.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx @@ -26,3 +26,5 @@ export AWS_SECRET_ACCESS_KEY="..." ``` This will export the data from the `some_table` table in the `test-db` database to a Delta Table in the `my_schema/my_table` path in the `my-bucket` bucket. + +You can now query this table in the Lakehouse node by creating an external table that references the Delta Table in the `my_schema/my_table` path. See [External Tables](../external_tables) for the details on how to do that. diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx index 9ff4a13b112..035653f5a00 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx @@ -15,14 +15,14 @@ However, this comes with some major caveats (which will eventually be resolved): ### Caveats -* The tables must be stored as [Delta Lake Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location +* The tables must be stored as [Delta Lake Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location. * A "Delta Lake Table" (or "Delta Table") is a folder of Parquet files along with some JSON metadata. * Each table must be prefixed with a `$schema/$table/` where `$schema` and `$table` are valid Postgres identifiers (i.e. < 64 characters) * For example, this is a valid Delta Table that will be recognized by Beacon Analytics: * `my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}` - * These `$schema` and `$table` identifiers will be queryable in the Lakehouse node, e.g.: + * These `$schema` and `$table` identifiers will be queryable in the Postgres Lakehouse node, e.g.: * `SELECT count(*) FROM my_schema.my_table;` - * This Delta Table will NOT be recognized by Lakehouse Analytics (missing a schema): + * This Delta Table will NOT be recognized by Postgres Lakehouse node (missing a schema): * `my_table/{part1.parquet, part2.parquet, _delta_log}` ### Loading data into your bucket @@ -30,26 +30,4 @@ However, this comes with some major caveats (which will eventually be resolved): You can use the `lakehouse-loader` utility to export data from an arbitrary Postgres instance to Delta Tables in a storage bucket. See [Delta Lake Table Tools](delta_tables) for more information on how to obtain and use that utility. -### Querying your own data - -By default, each Lakehouse node is configured to point to a bucket with benchmarking datasets inside. -To point it to a different bucket, you can call the `pgaa.create_storage_location` function: - -```sql -SELECT pgaa.create_storage_location.set_bucket_location('mystore', 's3://my-bucket'); -``` - -You will then be able create a table that references the Delta Table in the bucket: - -```sql -CREATE TABLE public.tablename () USING PGAA WITH (pgaa.storage_location = 'mystore', pgaa.path = 'schemaname/tablename'); -``` - -Which you can then query: - -```sql -SELECT COUNT(*) FROM public.tablename; -``` - For further details, see the [External Tables](../external_tables) documentation. - diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx index bd56a928a75..e3204279b7e 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/providers_and_regions.mdx @@ -31,5 +31,4 @@ To be precise: * Managed Storage Locations can only be created in EDB-hosted AWS regions * Lakehouse Sync can only sync from source databases in EDB-hosted AWS regions -These limitations will be removed as we continue to improve the product. Eventually, -we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. +These limitations will be removed as we continue to improve the product. Eventually, we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. From 84cb51aad769e2839aedee2652b4e210a5b728eb Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan Date: Mon, 4 Nov 2024 14:17:44 +0000 Subject: [PATCH 21/25] Case changes Signed-off-by: Dj Walker-Morgan --- advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx | 2 +- .../edb-postgres-ai/analytics/reference/delta_tables.mdx | 2 +- advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx index d4b95cf309f..4bad22d7ada 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/datasets.mdx @@ -1,5 +1,5 @@ --- -title: Benchmarking Datasets +title: Benchmarking datasets description: Benchmarking datasets available for Lakehouse --- diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx index 400efe5ee6d..8e67938baca 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/delta_tables.mdx @@ -1,6 +1,6 @@ --- title: Delta Lake Table tools -navTitle: Delta Table tools +navTitle: Delta Lake Table tools description: Tools for working with Delta Lake Tables --- diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx index c5a12cef84d..78801d6cb26 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/instances.mdx @@ -1,5 +1,5 @@ --- -title: Supported AWS Instances +title: Supported AWS instances description: Supported AWS instances for Lakehouse --- From c86608fb695c9141d72e4198ad7592910ce637e0 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan <126472455+djw-m@users.noreply.github.com> Date: Thu, 5 Dec 2024 10:22:34 +0000 Subject: [PATCH 22/25] Update advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx Co-authored-by: Artjoms Iskovs --- advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx index 45f71293c98..a5c287e359e 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx @@ -55,7 +55,7 @@ The following example creates an external table that references a Delta Lake Tab CREATE TABLE public.customer () USING PGAA WITH (pgaa.storage_location = 'sample-data', pgaa.path = 'tpch_sf_1/customer'); ``` -Note that the schema is not defined in the `CREATE TABLE` statement. The pgaa extension expects the schema to be defined in the storage location, and the schema itself is derived from the schema stored at the path specified in the `pgaa.path` option. The pgaa extension will infer the best Postgres-equivelant data types for the columns in the Delta Table. +Note that the schema is not defined in the `CREATE TABLE` statement. The pgaa extension expects the schema to be defined in the storage location, and the schema itself is derived from the schema stored at the path specified in the `pgaa.path` option. The pgaa extension will infer the best Postgres-equivalent data types for the columns in the Delta Table. ## Querying an External Table From 6c45b5ff1dc20459677b4a80bdc5905b99ea0675 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan <126472455+djw-m@users.noreply.github.com> Date: Thu, 5 Dec 2024 10:23:03 +0000 Subject: [PATCH 23/25] Update advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx Co-authored-by: Artjoms Iskovs --- advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx b/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx index 398ee3c3acb..705927681ac 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx @@ -98,7 +98,7 @@ We expect all introspection queries to work, and if you find one that doesn't, t ### Understand the constraints * Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those flavors in the installation when you connect. -* Queryable data (like the benchmarking datasets) is stored in object storage as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at scale factors 1 and 10. +* Queryable data (like the benchmarking datasets) is stored in object storage as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at scale factors from 1 to 1000. * Only AWS is supported at the moment. Bring Your Own Account (BYOA) is not supported. * You can deploy a cluster in any region that is activated in your EDB Postgres AI Account. Each region has a bucket with a copy of the benchmarking data, and so when you launch a cluster, it will use the benchmarking data in the location closest to it. From 50bd15bb568a482c1ba6a09067da2d0b2fe3adc7 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan <126472455+djw-m@users.noreply.github.com> Date: Thu, 5 Dec 2024 10:23:33 +0000 Subject: [PATCH 24/25] Update advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx Co-authored-by: Artjoms Iskovs --- .../edb-postgres-ai/analytics/reference/loadingdata.mdx | 7 ------- 1 file changed, 7 deletions(-) diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx index 035653f5a00..06c67a43c64 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx @@ -17,13 +17,6 @@ However, this comes with some major caveats (which will eventually be resolved): * The tables must be stored as [Delta Lake Tables](http://github.com/delta-io/delta/blob/master/PROTOCOL.md) within the location. * A "Delta Lake Table" (or "Delta Table") is a folder of Parquet files along with some JSON metadata. -* Each table must be prefixed with a `$schema/$table/` where `$schema` and `$table` are valid Postgres identifiers (i.e. < 64 characters) - * For example, this is a valid Delta Table that will be recognized by Beacon Analytics: - * `my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}` - * These `$schema` and `$table` identifiers will be queryable in the Postgres Lakehouse node, e.g.: - * `SELECT count(*) FROM my_schema.my_table;` - * This Delta Table will NOT be recognized by Postgres Lakehouse node (missing a schema): - * `my_table/{part1.parquet, part2.parquet, _delta_log}` ### Loading data into your bucket From 68b8ee96bb9b50acb028b292f8d8c58daa70b281 Mon Sep 17 00:00:00 2001 From: Dj Walker-Morgan <126472455+djw-m@users.noreply.github.com> Date: Thu, 5 Dec 2024 10:23:50 +0000 Subject: [PATCH 25/25] Update advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx Co-authored-by: Artjoms Iskovs --- advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx index 80890fb88bf..4e3c0f68c67 100644 --- a/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx +++ b/advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx @@ -16,7 +16,7 @@ Creates a new storage location that references an S3-compatible object storage l | Parameter | Type | Description | | --- | --- | --- | | `name` | `text` | The name of the storage location | -| `uri` | `text` | The URI of the S3-compatible object storage location | +| `uri` | `text` | The URI of the S3-compatible object storage location, for example, `s3://bucket-name` or `s3://bucket-name/prefix` | | `options` | `json` | Optional settings for the storage location | #### Options