This ETL works from the raw AWS Cost & Usage Report (CUR). It is implemented as a series of SQL statements with no out-of-band processing. See aws/dag-full-load.txt
for the order of queries.
See also this very long article about cloud cost control in general and how to use Cloudstats on AWS in particular.
The hope is that Cloudstats gives you a clean basis for your capacity planning and efficiency work. This work is based on experience with large AWS accounts over many years. Its accuracy is not guaranteed, but there are built-in ways to check its completeness and consistency.
There are two main output tables:
-
cloudstats
is a per-day aggregation and cleanup of only the AWS usage lineitems, with fully-amortized accrual-basis cost measurements. -
cloudstats_bill
is a per-month cash-basis aggregation of all lineitems (including taxes, credits, etc etc) reverse-engineered from the observed behavior of official AWS invoices. (See the code for more details.) It is meant to predict the official invoice to within 1%. It also provides much more detail, eg explanations of individual Taxes and Credits, and is updated daily.
There are several views laid on top of these main tables for specialized analyses:
-
cloudstats_bill_iceberg
does further attribution of discounts and credits. For example, backing out the separate effects of EDP, private rates, and reserved instance usage. -
cloudstats_cpu_ratecard
renders the historical per-CPU-hour rates you paid, aggregated on instance type andpricing_regime
. This includes percentile bands for real-world Spot prices, amortized RIs and Savings Plans, and so on. -
cloudstats_buying_efficiency
attempts to show new opportunites for using RI, SP, and Spot based on historical OnDemand usage. -
cloudstats_wow_cost_movers
does week-over-week comparisons of spend broken out by a few AWS and client-specific dimensions, to find interesting gainers and losers.
There are several dimension tables for client-specific needs and to fix up hilarious data bugs in the raw table:
-
cloudstats_dim_account_names.sql
is a list of your internal account aliases. -
cloudstats_dim_pricing_buckets.sql
maps the cleaned-uppricing_unit
field into a handful of buckets like 'Compute', 'Storage', and so on. -
cloudstats_dim_instance_specs.sql
provides machine specs based on an artificial field calledinstance_spec
. The CUR's originalinstance_type
field has lots of junk in it, and many records in the CUR have missing values for important fields like physical processor. For example, aninstance_type
of "ml.c5.xlarge-Training" will have aninstance_spec
of "c5.xlarge". This table is compiled from public sources and should be updated from time to time. -
cloudstats_dim_aws_products.sql
mapsproduct_code
to the official product name as it appears in the AWS bill. Not all of the CUR's lineitems have this information (!). This is also hand-compiled from public sources and may be incomplete.
-
The incremental load re-writes the previous 65 days of data, each day. This is because the CUR is a running account, not an immutable log. The source records in your raw CUR table may be updated or added weeks after the fact. For example, the
invoice_id
field is null until after the month's accounting close, usually 3 days after the calendar month. Once the invoice is generated by AWS, all of those records are updated with the invoice ID. Credits and monthly charges are often added post-hoc. -
The first two stages of the dag,
cloudstats_00
andcloudstats_01
are implemented as SQL views. There are no staging tables. This was for speed of development. This ETL was originally written for columnstore databases with partitioning & predicate pushdown (eg, Trino or Athena) so was not much of a performance problem. In OLTP Redshift this pattern is arguably a bad idea. See thewhere
clause at the bottom ofcloudstats_00
for more notes.
My general opinion is that you can't (or shouldn't) do automated anomaly detection until you have a very good idea about what is normal. A simple WoW comparator will very likely trigger tons of false positives. Simplistic threshholds will also generate alarm fatigue. Who cares if the foobar service increased by 212% (ZOMG!) when its cost basis is five bucks?
My favorite example is storage cost. You will find exhaustive complaints about AWS's 31-day pricing logic in the comments. The main cloudstats
table works hard to smooth it out.
But storage also comes in tiers, which are charged in descending order during the month. If you have a PB of data in S3, the first day of the month will show a little spike in total cost and effective rate paid. The following few days will show a lower rate, then the rest of the month will plateau at the lowest rate. Next month the cycle starts again. Even a z-score can be fooled by that.
Even worse, the automatic processes that move data to and from cheaper tiers can happen at any time. Storage anomaly detection should be run against usage & storage tier (eg storage_total_tb
and storage_class
) and not cost. And even then it needs to be smart enough to not pull the fire alarm for "normal" tier shifts.
Another pitfall is changing tags. Engineers often rename their clusters & services on their own schedule and for their own reasons. A simplistic detector grouped by service name will panic twice: when the foobar
service suddenly drops to zero and when the new foobar_v2
service pops out of nowhere with significant cost.
There is no general solution to anomaly detection over changing infrastructure. All I can recommend is to put a human in the loop before you email the world.