forked from 2i2c-org/infrastructure
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request 2i2c-org#3992 from sgibson91/billing/transform-cos…
…t-tables Automate cloud billing CSV transformation for dedicated clusters on GCP and AWS
- Loading branch information
Showing
5 changed files
with
230 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,8 +34,14 @@ The deployer has the following directory structure: | |
``` | ||
|
||
### The `cli_app.py` file | ||
|
||
The `cli_app.py` file is the file that contains the main `deployer` typer app and all of the main sub-apps "attached" to it, each corresponding to a `deployer` sub-command. These apps are used throughout the codebase. | ||
|
||
### The `__main__.py` file | ||
|
||
The `__main__.py` file is the main file that gets executed when the deployer is called. | ||
If you are adding any sub-command functions, you **must** import them in this file for them to be picked up by the package. | ||
|
||
### The `infra_components` directory | ||
|
||
This is the directory where the files that define the `Hub` and `Cluster` classes are stored. These files are imported and used throughout the deployer's codebase to emulate these objects programmatically. | ||
|
@@ -129,34 +135,28 @@ This section descripts some of the subcommands the `deployer` can carry out. | |
**Command line usage:** | ||
|
||
```bash | ||
|
||
Usage: deployer [OPTIONS] COMMAND [ARGS]... | ||
|
||
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ | ||
│ --install-completion [bash|zsh|fish|powershell|pwsh] Install completion for the specified shell. │ | ||
│ [default: None] │ | ||
│ --show-completion [bash|zsh|fish|powershell|pwsh] Show completion for the specified shell, to copy it or │ | ||
│ customize the installation. │ | ||
│ [default: None] │ | ||
│ --help Show this message and exit. │ | ||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ | ||
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮ | ||
│ cilogon-client Manage cilogon clients for hubs' authentication. │ | ||
│ debug Debug issues by accessing different components and their logs │ | ||
│ decrypt-age Decrypt secrets sent to `[email protected]` via `age` │ | ||
│ deploy Deploy one or more hubs in a given cluster │ | ||
│ deploy-support Deploy support components to a cluster │ | ||
│ exec Execute a shell in various parts of the infra. It can be used for poking around, or │ | ||
│ debugging issues. │ | ||
│ generate Generate various types of assets. It currently supports generating files related to │ | ||
│ billing, new dedicated clusters, helm upgrade strategies and resource allocation. │ | ||
│ grafana Manages Grafana related workflows. │ | ||
│ run-hub-health-check Run a health check on a given hub on a given cluster. Optionally check scaling of dask │ | ||
│ workers if the hub is a daskhub. │ | ||
│ use-cluster-credentials Pop a new shell or execute a command after authenticating to the given cluster using the │ | ||
│ deployer's credentials │ | ||
│ validate Validate configuration files such as helm chart values and cluster.yaml files. │ | ||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ | ||
Usage: deployer [OPTIONS] COMMAND [ARGS]... | ||
|
||
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ | ||
│ --install-completion Install completion for the current shell. │ | ||
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │ | ||
│ --help Show this message and exit. │ | ||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ | ||
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ | ||
│ cilogon-client Manage cilogon clients for hubs' authentication. │ | ||
│ debug Debug issues by accessing different components and their logs │ | ||
│ decrypt-age Decrypt secrets sent to `[email protected]` via `age` │ | ||
│ deploy Deploy one or more hubs in a given cluster │ | ||
│ deploy-support Deploy support components to a cluster │ | ||
│ exec Execute a shell in various parts of the infra. It can be used for poking around, or debugging issues. │ | ||
│ generate Generate various types of assets. It currently supports generating files related to billing, new dedicated clusters, helm upgrade strategies and resource │ | ||
│ allocation. │ | ||
│ grafana Manages Grafana related workflows. │ | ||
│ run-hub-health-check Run a health check on a given hub on a given cluster. Optionally check scaling of dask workers if the hub is a daskhub. │ | ||
│ transform Programmatically transform datasets, such as cost tables for billing purposes. │ | ||
│ use-cluster-credentials Pop a new shell or execute a command after authenticating to the given cluster using the deployer's credentials │ | ||
│ validate Validate configuration files such as helm chart values and cluster.yaml files. │ | ||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ | ||
``` | ||
|
||
### Standalone sub-commands related to deployment | ||
|
@@ -340,6 +340,24 @@ Gets the clusters that are in the infrastructure repository but are NOT register | |
##### `grafana central-ds get-rm-candidates` | ||
Gets the list of datasources that are registered in central grafana but are NOT in the list of clusters in the infrastructure repository. Usually this happens when a clusters was decommissioned, but its prometheus server was not removed from the datasources of the central 2i2c Grafana. This list can then be used to know which datasources to remove. | ||
|
||
### The `transform` sub-command | ||
|
||
This sub-command can be used to transform various datasets. | ||
|
||
#### `transform cost-table` | ||
|
||
This is a sub-command meant to help engineers transform cost tables generated by cloud vendors into the format required by our fiscal sponsor in order for them to bill our communities. | ||
This transformation is automated to avoid copy-paste errors from one CSV file to another. | ||
These transformations happen locally and another CSV file is outputted to the local directory, which then needs to be manually handed over to CS&S staff. | ||
|
||
##### `transform cost-table aws` | ||
|
||
Transforms a cost table generated in the AWS UI into the correct format. | ||
|
||
##### `transform cost-table gcp` | ||
|
||
Transforms a cost table generated in the GCP UI into the correct format. | ||
|
||
### The `validate` sub-command | ||
|
||
This function is used to validate the values files for each of our hubs against their helm chart's values schema. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
import re | ||
from pathlib import Path | ||
|
||
import pandas as pd | ||
import typer | ||
|
||
from deployer.cli_app import transform_app | ||
from deployer.utils.rendering import print_colour | ||
|
||
# Creates a new typer application, which is then nested as a sub-command named | ||
# "cost-table" under the `transform` sub-command of the deployer. | ||
cost_table_app = typer.Typer(pretty_exceptions_show_locals=False) | ||
transform_app.add_typer( | ||
cost_table_app, | ||
name="cost-table", | ||
help="Transform the cost table from a cloud provider when completing billing cycles", | ||
) | ||
|
||
|
||
@cost_table_app.command() | ||
def aws( | ||
input_path: Path = typer.Argument( | ||
..., help="The file path to the cost table downloaded from the AWS UI" | ||
), | ||
output_path: Path = typer.Option( | ||
None, | ||
help="(Optional) The path to write the output CSV to. If None, one will be constructed.", | ||
), | ||
): | ||
""" | ||
Ingests a CSV cost table generated via the AWS UI and performs a transformation. | ||
We assume the input CSV has a column for each linked account and rows for: | ||
- the linked account ID and name | ||
- a monthly total for each month selected | ||
- a linked account total across the month range | ||
We aim to have an output CSV file with columns: | ||
- linked account name | ||
- start-month | ||
- ... | ||
- end-month | ||
- linked account total | ||
""" | ||
# Read the CSV file into a pandas dataframe. Skip the first row and this | ||
# contains numerical project IDs - the project names begin on the second row. | ||
# We conditionally rename the column names. If '($)' is present in the column | ||
# name, we assume this is a linked account name: we strip of any | ||
# leading/trailing whitespace and convert to lower case so we have just the | ||
# account names and allow for AWS permitting whitespace in them. | ||
# Otherwise (i.e. not an account name), we also replace any whitespace with | ||
# underscores for easier data cleaning in pandas. | ||
df = pd.read_csv( | ||
input_path, | ||
skiprows=1, | ||
).rename( | ||
columns=lambda col: ( | ||
col.lower().strip("($)").strip() | ||
if "($)" in col | ||
else col.strip().lower().replace(" ", "_") | ||
) | ||
) | ||
|
||
# Ensure values of the linked_account_name column are lower case and any | ||
# whitespace is replaced with underscores. | ||
# Note that because AWS outputs the linked account names as a row, this | ||
# column name is misleading - we are not affecting linked account names here. | ||
df["linked_account_name"] = df["linked_account_name"].apply( | ||
lambda val: val.strip().lower().replace(" ", "_") | ||
) | ||
|
||
# Set the linked_account_name column as the index | ||
df.set_index("linked_account_name", drop=True, inplace=True) | ||
|
||
# Drop the 'total costs' column. This column is the total across all linked | ||
# accounts, and hence not necessary. We use the linked_account_total column | ||
# for the total per linked account. | ||
df.drop("total costs", axis=1, inplace=True) | ||
|
||
# Transpose the dataframe | ||
df = df.T | ||
|
||
# Sort the columns | ||
df = df.reindex(sorted(df.columns), axis=1) | ||
|
||
# Sort the account names in alphabetical order | ||
df.sort_index(inplace=True) | ||
|
||
if output_path is None: | ||
# Find all the column names that match the regex expression `[0-9]*-[0-9]*-[0-9]*` | ||
months = [ | ||
col | ||
for col in df.columns | ||
if re.match("[0-9]*-[0-9]*-[0-9]*", col) is not None | ||
] | ||
|
||
# Construct output filename | ||
output_path = Path( | ||
f"2i2c_dedicated_cluster_billing_AWS_{months[0]}_{months[-1]}.csv" | ||
) | ||
|
||
print_colour(f"Writing output CSV to: {output_path}") | ||
|
||
# Save CSV file | ||
df.to_csv(output_path, index_label="project_name") | ||
|
||
|
||
@cost_table_app.command() | ||
def gcp( | ||
input_path: Path = typer.Argument( | ||
..., help="The file path to the cost table downloaded from the GCP UI" | ||
), | ||
output_path: Path = typer.Option( | ||
None, | ||
help="(Optional) The path to write the output CSV to. If None, one will be constructed.", | ||
), | ||
): | ||
""" | ||
Ingests a CSV cost table generated via the GCP UI and performs a transformation. | ||
We assume the input CSV file has the following columns (and are subject to | ||
changes by GCP): | ||
- Project name | ||
- Month | ||
- Subtotal ($) | ||
We aim to have an output CSV file with columns: | ||
- Project name | ||
- start-month | ||
- ... | ||
- end-month | ||
- Total | ||
where: | ||
- start-month...end-month are unique values from the Month column in the | ||
input file | ||
- Project name are unique entries from the input file | ||
- The Total column is the sum across all month columns for each project | ||
""" | ||
# Read the CSV file into a pandas dataframe. Only select relevant columns from | ||
# the input file: [Project Name, Month, Subtotal ($)]. Rename the columns so | ||
# that they are all lower case and any whitespace in column names is replaced | ||
# with an underscore. | ||
df = pd.read_csv( | ||
input_path, usecols=["Month", "Project name", "Subtotal ($)"] | ||
).rename(columns=lambda col: col.strip().lower().replace(" ", "_")) | ||
|
||
# Aggregate and pivot the dataframe into desired format | ||
transformed_df = ( | ||
# Group the data by project name and month, and sum the subtotals | ||
df.groupby(["project_name", "month"]).sum() | ||
# Pivot the data so project name is the index and months are columns | ||
.pivot_table(index="project_name", columns="month", values="subtotal_($)") | ||
# Create a new column containing the total across all the months | ||
.assign(total=lambda df: df.sum(axis=1)) | ||
) | ||
|
||
if output_path is None: | ||
# Find all the column names that match the regex expression `[0-9]*-[0-9]*` | ||
months = [ | ||
col | ||
for col in transformed_df.columns | ||
if re.match("[0-9]*-[0-9]*", col) is not None | ||
] | ||
|
||
# Construct output filename | ||
output_path = Path( | ||
f"2i2c_dedicated_cluster_billing_GCP_{months[0]}_{months[-1]}.csv" | ||
) | ||
|
||
print_colour(f"Writing output CSV to: {output_path}") | ||
|
||
# Save CSV file | ||
transformed_df.to_csv(output_path) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters