Skip to content

Commit

Permalink
Merge pull request 2i2c-org#3992 from sgibson91/billing/transform-cos…
Browse files Browse the repository at this point in the history
…t-tables

Automate cloud billing CSV transformation for dedicated clusters on GCP and AWS
  • Loading branch information
sgibson91 authored May 15, 2024
2 parents 1aa6675 + 96f4dfd commit 22120f3
Show file tree
Hide file tree
Showing 5 changed files with 230 additions and 30 deletions.
74 changes: 46 additions & 28 deletions deployer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,14 @@ The deployer has the following directory structure:
```

### The `cli_app.py` file

The `cli_app.py` file is the file that contains the main `deployer` typer app and all of the main sub-apps "attached" to it, each corresponding to a `deployer` sub-command. These apps are used throughout the codebase.

### The `__main__.py` file

The `__main__.py` file is the main file that gets executed when the deployer is called.
If you are adding any sub-command functions, you **must** import them in this file for them to be picked up by the package.

### The `infra_components` directory

This is the directory where the files that define the `Hub` and `Cluster` classes are stored. These files are imported and used throughout the deployer's codebase to emulate these objects programmatically.
Expand Down Expand Up @@ -129,34 +135,28 @@ This section descripts some of the subcommands the `deployer` can carry out.
**Command line usage:**

```bash

Usage: deployer [OPTIONS] COMMAND [ARGS]...

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion [bash|zsh|fish|powershell|pwsh] Install completion for the specified shell. │
│ [default: None] │
│ --show-completion [bash|zsh|fish|powershell|pwsh] Show completion for the specified shell, to copy it or │
│ customize the installation. │
│ [default: None] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ cilogon-client Manage cilogon clients for hubs' authentication. │
│ debug Debug issues by accessing different components and their logs │
│ decrypt-age Decrypt secrets sent to `[email protected]` via `age` │
│ deploy Deploy one or more hubs in a given cluster │
│ deploy-support Deploy support components to a cluster │
│ exec Execute a shell in various parts of the infra. It can be used for poking around, or │
│ debugging issues. │
│ generate Generate various types of assets. It currently supports generating files related to │
│ billing, new dedicated clusters, helm upgrade strategies and resource allocation. │
│ grafana Manages Grafana related workflows. │
│ run-hub-health-check Run a health check on a given hub on a given cluster. Optionally check scaling of dask │
│ workers if the hub is a daskhub. │
│ use-cluster-credentials Pop a new shell or execute a command after authenticating to the given cluster using the │
│ deployer's credentials │
│ validate Validate configuration files such as helm chart values and cluster.yaml files. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: deployer [OPTIONS] COMMAND [ARGS]...

╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ cilogon-client Manage cilogon clients for hubs' authentication. │
│ debug Debug issues by accessing different components and their logs │
│ decrypt-age Decrypt secrets sent to `[email protected]` via `age` │
│ deploy Deploy one or more hubs in a given cluster │
│ deploy-support Deploy support components to a cluster │
│ exec Execute a shell in various parts of the infra. It can be used for poking around, or debugging issues. │
│ generate Generate various types of assets. It currently supports generating files related to billing, new dedicated clusters, helm upgrade strategies and resource │
│ allocation. │
│ grafana Manages Grafana related workflows. │
│ run-hub-health-check Run a health check on a given hub on a given cluster. Optionally check scaling of dask workers if the hub is a daskhub. │
│ transform Programmatically transform datasets, such as cost tables for billing purposes. │
│ use-cluster-credentials Pop a new shell or execute a command after authenticating to the given cluster using the deployer's credentials │
│ validate Validate configuration files such as helm chart values and cluster.yaml files. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

### Standalone sub-commands related to deployment
Expand Down Expand Up @@ -340,6 +340,24 @@ Gets the clusters that are in the infrastructure repository but are NOT register
##### `grafana central-ds get-rm-candidates`
Gets the list of datasources that are registered in central grafana but are NOT in the list of clusters in the infrastructure repository. Usually this happens when a clusters was decommissioned, but its prometheus server was not removed from the datasources of the central 2i2c Grafana. This list can then be used to know which datasources to remove.

### The `transform` sub-command

This sub-command can be used to transform various datasets.

#### `transform cost-table`

This is a sub-command meant to help engineers transform cost tables generated by cloud vendors into the format required by our fiscal sponsor in order for them to bill our communities.
This transformation is automated to avoid copy-paste errors from one CSV file to another.
These transformations happen locally and another CSV file is outputted to the local directory, which then needs to be manually handed over to CS&S staff.

##### `transform cost-table aws`

Transforms a cost table generated in the AWS UI into the correct format.

##### `transform cost-table gcp`

Transforms a cost table generated in the GCP UI into the correct format.

### The `validate` sub-command

This function is used to validate the values files for each of our hubs against their helm chart's values schema.
Expand Down
1 change: 1 addition & 0 deletions deployer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import deployer.commands.grafana.central_grafana # noqa: F401
import deployer.commands.grafana.deploy_dashboards # noqa: F401
import deployer.commands.grafana.tokens # noqa: F401
import deployer.commands.transform.cost_table # noqa: F401
import deployer.commands.validate.config # noqa: F401
import deployer.keys.decrypt_age # noqa: F401

Expand Down
6 changes: 6 additions & 0 deletions deployer/cli_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
exec_app = typer.Typer(pretty_exceptions_show_locals=False)
grafana_app = typer.Typer(pretty_exceptions_show_locals=False)
validate_app = typer.Typer(pretty_exceptions_show_locals=False)
transform_app = typer.Typer(pretty_exceptions_show_locals=False)

app.add_typer(
generate_app,
Expand Down Expand Up @@ -51,3 +52,8 @@
name="validate",
help="Validate configuration files such as helm chart values and cluster.yaml files.",
)
app.add_typer(
transform_app,
name="transform",
help="Programmatically transform datasets, such as cost tables for billing purposes.",
)
174 changes: 174 additions & 0 deletions deployer/commands/transform/cost_table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
import re
from pathlib import Path

import pandas as pd
import typer

from deployer.cli_app import transform_app
from deployer.utils.rendering import print_colour

# Creates a new typer application, which is then nested as a sub-command named
# "cost-table" under the `transform` sub-command of the deployer.
cost_table_app = typer.Typer(pretty_exceptions_show_locals=False)
transform_app.add_typer(
cost_table_app,
name="cost-table",
help="Transform the cost table from a cloud provider when completing billing cycles",
)


@cost_table_app.command()
def aws(
input_path: Path = typer.Argument(
..., help="The file path to the cost table downloaded from the AWS UI"
),
output_path: Path = typer.Option(
None,
help="(Optional) The path to write the output CSV to. If None, one will be constructed.",
),
):
"""
Ingests a CSV cost table generated via the AWS UI and performs a transformation.
We assume the input CSV has a column for each linked account and rows for:
- the linked account ID and name
- a monthly total for each month selected
- a linked account total across the month range
We aim to have an output CSV file with columns:
- linked account name
- start-month
- ...
- end-month
- linked account total
"""
# Read the CSV file into a pandas dataframe. Skip the first row and this
# contains numerical project IDs - the project names begin on the second row.
# We conditionally rename the column names. If '($)' is present in the column
# name, we assume this is a linked account name: we strip of any
# leading/trailing whitespace and convert to lower case so we have just the
# account names and allow for AWS permitting whitespace in them.
# Otherwise (i.e. not an account name), we also replace any whitespace with
# underscores for easier data cleaning in pandas.
df = pd.read_csv(
input_path,
skiprows=1,
).rename(
columns=lambda col: (
col.lower().strip("($)").strip()
if "($)" in col
else col.strip().lower().replace(" ", "_")
)
)

# Ensure values of the linked_account_name column are lower case and any
# whitespace is replaced with underscores.
# Note that because AWS outputs the linked account names as a row, this
# column name is misleading - we are not affecting linked account names here.
df["linked_account_name"] = df["linked_account_name"].apply(
lambda val: val.strip().lower().replace(" ", "_")
)

# Set the linked_account_name column as the index
df.set_index("linked_account_name", drop=True, inplace=True)

# Drop the 'total costs' column. This column is the total across all linked
# accounts, and hence not necessary. We use the linked_account_total column
# for the total per linked account.
df.drop("total costs", axis=1, inplace=True)

# Transpose the dataframe
df = df.T

# Sort the columns
df = df.reindex(sorted(df.columns), axis=1)

# Sort the account names in alphabetical order
df.sort_index(inplace=True)

if output_path is None:
# Find all the column names that match the regex expression `[0-9]*-[0-9]*-[0-9]*`
months = [
col
for col in df.columns
if re.match("[0-9]*-[0-9]*-[0-9]*", col) is not None
]

# Construct output filename
output_path = Path(
f"2i2c_dedicated_cluster_billing_AWS_{months[0]}_{months[-1]}.csv"
)

print_colour(f"Writing output CSV to: {output_path}")

# Save CSV file
df.to_csv(output_path, index_label="project_name")


@cost_table_app.command()
def gcp(
input_path: Path = typer.Argument(
..., help="The file path to the cost table downloaded from the GCP UI"
),
output_path: Path = typer.Option(
None,
help="(Optional) The path to write the output CSV to. If None, one will be constructed.",
),
):
"""
Ingests a CSV cost table generated via the GCP UI and performs a transformation.
We assume the input CSV file has the following columns (and are subject to
changes by GCP):
- Project name
- Month
- Subtotal ($)
We aim to have an output CSV file with columns:
- Project name
- start-month
- ...
- end-month
- Total
where:
- start-month...end-month are unique values from the Month column in the
input file
- Project name are unique entries from the input file
- The Total column is the sum across all month columns for each project
"""
# Read the CSV file into a pandas dataframe. Only select relevant columns from
# the input file: [Project Name, Month, Subtotal ($)]. Rename the columns so
# that they are all lower case and any whitespace in column names is replaced
# with an underscore.
df = pd.read_csv(
input_path, usecols=["Month", "Project name", "Subtotal ($)"]
).rename(columns=lambda col: col.strip().lower().replace(" ", "_"))

# Aggregate and pivot the dataframe into desired format
transformed_df = (
# Group the data by project name and month, and sum the subtotals
df.groupby(["project_name", "month"]).sum()
# Pivot the data so project name is the index and months are columns
.pivot_table(index="project_name", columns="month", values="subtotal_($)")
# Create a new column containing the total across all the months
.assign(total=lambda df: df.sum(axis=1))
)

if output_path is None:
# Find all the column names that match the regex expression `[0-9]*-[0-9]*`
months = [
col
for col in transformed_df.columns
if re.match("[0-9]*-[0-9]*", col) is not None
]

# Construct output filename
output_path = Path(
f"2i2c_dedicated_cluster_billing_GCP_{months[0]}_{months[-1]}.csv"
)

print_colour(f"Writing output CSV to: {output_path}")

# Save CSV file
transformed_df.to_csv(output_path)
5 changes: 3 additions & 2 deletions docs/howto/bill.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,9 @@ AWS management account. If a future cluster deviates from this, you can tell by
4. Visit the "Monthly Costs By Linked Account" report ([direct link]) via "Billing and Cost Management" -> "Cost Explorer Saved Reports"
5. On the right sidebar under "Time -> Date Range", select all the completed months we want to get data for
6. On the right sidebar under "Time -> Granularity", ensure its selected as "Monthly"
7. Click the 'Download as CSV' button
8. Copy AWS costs
7. On the right sidebar under "Group by -> Dimension", select "Linked account"
8. Click the 'Download as CSV' button
9. Copy AWS costs

The CSV file has rows for each month, and columns for each project. Copy it
into the spreadsheet, making sure the rows and columns both match what is
Expand Down

0 comments on commit 22120f3

Please sign in to comment.