Skip to content

aristus/cloudstats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cloudstats ETL

This ETL works from the raw AWS Cost & Usage Report (CUR). It is implemented as a series of SQL statements with no out-of-band processing. See aws/dag-full-load.txt for the order of queries.

See also this very long article about cloud cost control in general and how to use Cloudstats on AWS in particular.

The hope is that Cloudstats gives you a clean basis for your capacity planning and efficiency work. This work is based on experience with large AWS accounts over many years. Its accuracy is not guaranteed, but there are built-in ways to check its completeness and consistency.

Main output

There are two main output tables:

  1. cloudstats is a per-day aggregation and cleanup of only the AWS usage lineitems, with fully-amortized accrual-basis cost measurements.

  2. cloudstats_bill is a per-month cash-basis aggregation of all lineitems (including taxes, credits, etc etc) reverse-engineered from the observed behavior of official AWS invoices. (See the code for more details.) It is meant to predict the official invoice to within 1%. It also provides much more detail, eg explanations of individual Taxes and Credits, and is updated daily.

Satellite views

There are several views laid on top of these main tables for specialized analyses:

  1. cloudstats_bill_iceberg does further attribution of discounts and credits. For example, backing out the separate effects of EDP, private rates, and reserved instance usage.

  2. cloudstats_cpu_ratecard renders the historical per-CPU-hour rates you paid, aggregated on instance type and pricing_regime. This includes percentile bands for real-world Spot prices, amortized RIs and Savings Plans, and so on.

  3. cloudstats_buying_efficiency attempts to show new opportunites for using RI, SP, and Spot based on historical OnDemand usage.

  4. cloudstats_wow_cost_movers does week-over-week comparisons of spend broken out by a few AWS and client-specific dimensions, to find interesting gainers and losers.

Dimension tables

There are several dimension tables for client-specific needs and to fix up hilarious data bugs in the raw table:

  1. cloudstats_dim_account_names.sql is a list of your internal account aliases.

  2. cloudstats_dim_pricing_buckets.sql maps the cleaned-up pricing_unit field into a handful of buckets like 'Compute', 'Storage', and so on.

  3. cloudstats_dim_instance_specs.sql provides machine specs based on an artificial field called instance_spec. The CUR's original instance_type field has lots of junk in it, and many records in the CUR have missing values for important fields like physical processor. For example, an instance_type of "ml.c5.xlarge-Training" will have an instance_spec of "c5.xlarge". This table is compiled from public sources and should be updated from time to time.

  4. cloudstats_dim_aws_products.sql maps product_code to the official product name as it appears in the AWS bill. Not all of the CUR's lineitems have this information (!). This is also hand-compiled from public sources and may be incomplete.

Quirks and gotchas

This ETL is unusual for two reasons:

  1. The incremental load re-writes the previous 65 days of data, each day. This is because the CUR is a running account, not an immutable log. The source records in your raw CUR table may be updated or added weeks after the fact. For example, the invoice_id field is null until after the month's accounting close, usually 3 days after the calendar month. Once the invoice is generated by AWS, all of those records are updated with the invoice ID. Credits and monthly charges are often added post-hoc.

  2. The first two stages of the dag, cloudstats_00 and cloudstats_01 are implemented as SQL views. There are no staging tables. This was for speed of development. This ETL was originally written for columnstore databases with partitioning & predicate pushdown (eg, Trino or Athena) so was not much of a performance problem. In OLTP Redshift this pattern is arguably a bad idea. See the where clause at the bottom of cloudstats_00 for more notes.

Anomaly detection

My general opinion is that you can't (or shouldn't) do automated anomaly detection until you have a very good idea about what is normal. A simple WoW comparator will very likely trigger tons of false positives. Simplistic threshholds will also generate alarm fatigue. Who cares if the foobar service increased by 212% (ZOMG!) when its cost basis is five bucks?

My favorite example is storage cost. You will find exhaustive complaints about AWS's 31-day pricing logic in the comments. The main cloudstats table works hard to smooth it out.

But storage also comes in tiers, which are charged in descending order during the month. If you have a PB of data in S3, the first day of the month will show a little spike in total cost and effective rate paid. The following few days will show a lower rate, then the rest of the month will plateau at the lowest rate. Next month the cycle starts again. Even a z-score can be fooled by that.

Even worse, the automatic processes that move data to and from cheaper tiers can happen at any time. Storage anomaly detection should be run against usage & storage tier (eg storage_total_tb and storage_class) and not cost. And even then it needs to be smart enough to not pull the fire alarm for "normal" tier shifts.

Another pitfall is changing tags. Engineers often rename their clusters & services on their own schedule and for their own reasons. A simplistic detector grouped by service name will panic twice: when the foobar service suddenly drops to zero and when the new foobar_v2 service pops out of nowhere with significant cost.

There is no general solution to anomaly detection over changing infrastructure. All I can recommend is to put a human in the loop before you email the world.