Merge pull request 2i2c-org#3992 from sgibson91/billing/transform-cos…

…t-tables Automate cloud billing CSV transformation for dedicated clusters on GCP and AWS
sgibson91 · May 15, 2024 · 22120f3 · 22120f3
2 parents 1aa6675 + 96f4dfd
commit 22120f3
Show file tree

Hide file tree

Showing 5 changed files with 230 additions and 30 deletions.
diff --git a/deployer/README.md b/deployer/README.md
@@ -34,8 +34,14 @@ The deployer has the following directory structure:
 ```
 
 ### The `cli_app.py` file
+
 The `cli_app.py` file is the file that contains the main `deployer` typer app and all of the main sub-apps "attached" to it, each corresponding to a `deployer` sub-command. These apps are used throughout the codebase.
 
+### The `__main__.py` file
+
+The `__main__.py` file is the main file that gets executed when the deployer is called.
+If you are adding any sub-command functions, you **must** import them in this file for them to be picked up by the package.
+
 ### The `infra_components` directory
 
 This is the directory where the files that define the `Hub` and `Cluster` classes are stored. These files are imported and used throughout the deployer's codebase to emulate these objects programmatically.
@@ -129,34 +135,28 @@ This section descripts some of the subcommands the `deployer` can carry out.
 **Command line usage:**
 
 ```bash
-
- Usage: deployer [OPTIONS] COMMAND [ARGS]...                                                                            
-
-╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
-│ --install-completion        [bash|zsh|fish|powershell|pwsh]  Install completion for the specified shell.             │
-│                                                              [default: None]                                         │
-│ --show-completion           [bash|zsh|fish|powershell|pwsh]  Show completion for the specified shell, to copy it or  │
-│                                                              customize the installation.                             │
-│                                                              [default: None]                                         │
-│ --help                                                       Show this message and exit.                             │
-╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
-╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
-│ cilogon-client           Manage cilogon clients for hubs' authentication.                                            │
-│ debug                    Debug issues by accessing different components and their logs                               │
-│ decrypt-age              Decrypt secrets sent to `[email protected]` via `age`                                        │
-│ deploy                   Deploy one or more hubs in a given cluster                                                  │
-│ deploy-support           Deploy support components to a cluster                                                      │
-│ exec                     Execute a shell in various parts of the infra. It can be used for poking around, or         │
-│                          debugging issues.                                                                           │
-│ generate                 Generate various types of assets. It currently supports generating files related to         │
-│                          billing, new dedicated clusters, helm upgrade strategies and resource allocation.           │
-│ grafana                  Manages Grafana related workflows.                                                          │
-│ run-hub-health-check     Run a health check on a given hub on a given cluster. Optionally check scaling of dask      │
-│                          workers if the hub is a daskhub.                                                            │
-│ use-cluster-credentials  Pop a new shell or execute a command after authenticating to the given cluster using the    │
-│                          deployer's credentials                                                                      │
-│ validate                 Validate configuration files such as helm chart values and cluster.yaml files.              │
-╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+ Usage: deployer [OPTIONS] COMMAND [ARGS]...                
+
+╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --install-completion          Install completion for the current shell.                                                                                                                   │
+│ --show-completion             Show completion for the current shell, to copy it or customize the installation.                                                                            │
+│ --help                        Show this message and exit.                                                                                                                                 │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ cilogon-client            Manage cilogon clients for hubs' authentication.                                                                                                                │
+│ debug                     Debug issues by accessing different components and their logs                                                                                                   │
+│ decrypt-age               Decrypt secrets sent to `[email protected]` via `age`                                                                                                            │
+│ deploy                    Deploy one or more hubs in a given cluster                                                                                                                      │
+│ deploy-support            Deploy support components to a cluster                                                                                                                          │
+│ exec                      Execute a shell in various parts of the infra. It can be used for poking around, or debugging issues.                                                           │
+│ generate                  Generate various types of assets. It currently supports generating files related to billing, new dedicated clusters, helm upgrade strategies and resource       │
+│                           allocation.                                                                                                                                                     │
+│ grafana                   Manages Grafana related workflows.                                                                                                                              │
+│ run-hub-health-check      Run a health check on a given hub on a given cluster. Optionally check scaling of dask workers if the hub is a daskhub.                                         │
+│ transform                 Programmatically transform datasets, such as cost tables for billing purposes.                                                                                  │
+│ use-cluster-credentials   Pop a new shell or execute a command after authenticating to the given cluster using the deployer's credentials                                                 │
+│ validate                  Validate configuration files such as helm chart values and cluster.yaml files.                                                                                  │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
 ```
 
 ### Standalone sub-commands related to deployment
@@ -340,6 +340,24 @@ Gets the clusters that are in the infrastructure repository but are NOT register
 ##### `grafana central-ds get-rm-candidates`
 Gets the list of datasources that are registered in central grafana but are NOT in the list of clusters in the infrastructure repository. Usually this happens when a clusters was decommissioned, but its prometheus server was not removed from the datasources of the central 2i2c Grafana. This list can then be used to know which datasources to remove.
 
+### The `transform` sub-command
+
+This sub-command can be used to transform various datasets.
+
+#### `transform cost-table`
+
+This is a sub-command meant to help engineers transform cost tables generated by cloud vendors into the format required by our fiscal sponsor in order for them to bill our communities.
+This transformation is automated to avoid copy-paste errors from one CSV file to another.
+These transformations happen locally and another CSV file is outputted to the local directory, which then needs to be manually handed over to CS&S staff.
+
+##### `transform cost-table aws`
+
+Transforms a cost table generated in the AWS UI into the correct format.
+
+##### `transform cost-table gcp`
+
+Transforms a cost table generated in the GCP UI into the correct format.
+
 ### The `validate` sub-command
 
 This function is used to validate the values files for each of our hubs against their helm chart's values schema.

diff --git a/deployer/__main__.py b/deployer/__main__.py
@@ -20,6 +20,7 @@
 import deployer.commands.grafana.central_grafana  # noqa: F401
 import deployer.commands.grafana.deploy_dashboards  # noqa: F401
 import deployer.commands.grafana.tokens  # noqa: F401
+import deployer.commands.transform.cost_table  # noqa: F401
 import deployer.commands.validate.config  # noqa: F401
 import deployer.keys.decrypt_age  # noqa: F401
 

diff --git a/deployer/cli_app.py b/deployer/cli_app.py
@@ -18,6 +18,7 @@
 exec_app = typer.Typer(pretty_exceptions_show_locals=False)
 grafana_app = typer.Typer(pretty_exceptions_show_locals=False)
 validate_app = typer.Typer(pretty_exceptions_show_locals=False)
+transform_app = typer.Typer(pretty_exceptions_show_locals=False)
 
 app.add_typer(
     generate_app,
@@ -51,3 +52,8 @@
     name="validate",
     help="Validate configuration files such as helm chart values and cluster.yaml files.",
 )
+app.add_typer(
+    transform_app,
+    name="transform",
+    help="Programmatically transform datasets, such as cost tables for billing purposes.",
+)
diff --git a/deployer/commands/transform/cost_table.py b/deployer/commands/transform/cost_table.py
@@ -0,0 +1,174 @@
+import re
+from pathlib import Path
+
+import pandas as pd
+import typer
+
+from deployer.cli_app import transform_app
+from deployer.utils.rendering import print_colour
+
+# Creates a new typer application, which is then nested as a sub-command named
+# "cost-table" under the `transform` sub-command of the deployer.
+cost_table_app = typer.Typer(pretty_exceptions_show_locals=False)
+transform_app.add_typer(
+    cost_table_app,
+    name="cost-table",
+    help="Transform the cost table from a cloud provider when completing billing cycles",
+)
+
+
+@cost_table_app.command()
+def aws(
+    input_path: Path = typer.Argument(
+        ..., help="The file path to the cost table downloaded from the AWS UI"
+    ),
+    output_path: Path = typer.Option(
+        None,
+        help="(Optional) The path to write the output CSV to. If None, one will be constructed.",
+    ),
+):
+    """
+    Ingests a CSV cost table generated via the AWS UI and performs a transformation.
+
+    We assume the input CSV has a column for each linked account and rows for:
+    - the linked account ID and name
+    - a monthly total for each month selected
+    - a linked account total across the month range
+
+    We aim to have an output CSV file with columns:
+    - linked account name
+    - start-month
+    - ...
+    - end-month
+    - linked account total
+    """
+    # Read the CSV file into a pandas dataframe. Skip the first row and this
+    # contains numerical project IDs - the project names begin on the second row.
+    # We conditionally rename the column names. If '($)' is present in the column
+    # name, we assume this is a linked account name: we strip of any
+    # leading/trailing whitespace and convert to lower case so we have just the
+    # account names and allow for AWS permitting whitespace in them.
+    # Otherwise (i.e. not an account name), we also replace any whitespace with
+    # underscores for easier data cleaning in pandas.
+    df = pd.read_csv(
+        input_path,
+        skiprows=1,
+    ).rename(
+        columns=lambda col: (
+            col.lower().strip("($)").strip()
+            if "($)" in col
+            else col.strip().lower().replace(" ", "_")
+        )
+    )
+
+    # Ensure values of the linked_account_name column are lower case and any
+    # whitespace is replaced with underscores.
+    # Note that because AWS outputs the linked account names as a row, this
+    # column name is misleading - we are not affecting linked account names here.
+    df["linked_account_name"] = df["linked_account_name"].apply(
+        lambda val: val.strip().lower().replace(" ", "_")
+    )
+
+    # Set the linked_account_name column as the index
+    df.set_index("linked_account_name", drop=True, inplace=True)
+
+    # Drop the 'total costs' column. This column is the total across all linked
+    # accounts, and hence not necessary. We use the linked_account_total column
+    # for the total per linked account.
+    df.drop("total costs", axis=1, inplace=True)
+
+    # Transpose the dataframe
+    df = df.T
+
+    # Sort the columns
+    df = df.reindex(sorted(df.columns), axis=1)
+
+    # Sort the account names in alphabetical order
+    df.sort_index(inplace=True)
+
+    if output_path is None:
+        # Find all the column names that match the regex expression `[0-9]*-[0-9]*-[0-9]*`
+        months = [
+            col
+            for col in df.columns
+            if re.match("[0-9]*-[0-9]*-[0-9]*", col) is not None
+        ]
+
+        # Construct output filename
+        output_path = Path(
+            f"2i2c_dedicated_cluster_billing_AWS_{months[0]}_{months[-1]}.csv"
+        )
+
+    print_colour(f"Writing output CSV to: {output_path}")
+
+    # Save CSV file
+    df.to_csv(output_path, index_label="project_name")
+
+
+@cost_table_app.command()
+def gcp(
+    input_path: Path = typer.Argument(
+        ..., help="The file path to the cost table downloaded from the GCP UI"
+    ),
+    output_path: Path = typer.Option(
+        None,
+        help="(Optional) The path to write the output CSV to. If None, one will be constructed.",
+    ),
+):
+    """
+    Ingests a CSV cost table generated via the GCP UI and performs a transformation.
+
+    We assume the input CSV file has the following columns (and are subject to
+    changes by GCP):
+    - Project name
+    - Month
+    - Subtotal ($)
+
+    We aim to have an output CSV file with columns:
+    - Project name
+    - start-month
+    - ...
+    - end-month
+    - Total
+
+    where:
+    - start-month...end-month are unique values from the Month column in the
+      input file
+    - Project name are unique entries from the input file
+    - The Total column is the sum across all month columns for each project
+    """
+    # Read the CSV file into a pandas dataframe. Only select relevant columns from
+    # the input file: [Project Name, Month, Subtotal ($)]. Rename the columns so
+    # that they are all lower case and any whitespace in column names is replaced
+    # with an underscore.
+    df = pd.read_csv(
+        input_path, usecols=["Month", "Project name", "Subtotal ($)"]
+    ).rename(columns=lambda col: col.strip().lower().replace(" ", "_"))
+
+    # Aggregate and pivot the dataframe into desired format
+    transformed_df = (
+        # Group the data by project name and month, and sum the subtotals
+        df.groupby(["project_name", "month"]).sum()
+        # Pivot the data so project name is the index and months are columns
+        .pivot_table(index="project_name", columns="month", values="subtotal_($)")
+        # Create a new column containing the total across all the months
+        .assign(total=lambda df: df.sum(axis=1))
+    )
+
+    if output_path is None:
+        # Find all the column names that match the regex expression `[0-9]*-[0-9]*`
+        months = [
+            col
+            for col in transformed_df.columns
+            if re.match("[0-9]*-[0-9]*", col) is not None
+        ]
+
+        # Construct output filename
+        output_path = Path(
+            f"2i2c_dedicated_cluster_billing_GCP_{months[0]}_{months[-1]}.csv"
+        )
+
+    print_colour(f"Writing output CSV to: {output_path}")
+
+    # Save CSV file
+    transformed_df.to_csv(output_path)
diff --git a/docs/howto/bill.md b/docs/howto/bill.md
@@ -56,8 +56,9 @@ AWS management account. If a future cluster deviates from this, you can tell by
 4. Visit the "Monthly Costs By Linked Account" report ([direct link]) via "Billing and Cost Management" -> "Cost Explorer Saved Reports"
 5. On the right sidebar under "Time -> Date Range", select all the completed months we want to get data for
 6. On the right sidebar under "Time -> Granularity", ensure its selected as "Monthly"
-7. Click the 'Download as CSV' button
-8. Copy AWS costs
+7. On the right sidebar under "Group by -> Dimension", select "Linked account"
+8. Click the 'Download as CSV' button
+9. Copy AWS costs
 
    The CSV file has rows for each month, and columns for each project. Copy it
    into the spreadsheet, making sure the rows and columns both match what is