Convalesco | Notes on dynamodb auto-scaling

DynamoDB is a fast and flexible NoSQL database service for any scale offered by Amazon. As most AWS offerings calculating the usage cost can be quite a challenge. The important thing to remember when working with dynamoDB is that writes are expensive while reads are cheap.

We’re evaluating dynamoDB auto-scaling in my workplace. Auto-scaling seems to work well for bell-shaped ops patterns. If the ops pattern is not predictable, might be a good idea to take a look at an alternative auto-scaler for dynamo.

There are three values that take part dynamoDB auto-scaling:

consumed capacity: The sum of write or read operations
provisioned capacity: The provisioned value for read or write operations
target utilization: (consumed capacity / provisioned capacity) * 100

The target utilization is the percentage auto-scaling will adjust provisioned capacity, as the consumed capacity grows. It’s essentially the buffer you need to avoid throttling. When throttling occurs the request will receive an error, HTTP 400 status code, and will not be served.

One read capacity unit consists of any item up to 4KB. If it’s more than that, you get charged accordingly. For example fetching a 7KB item, will count as 3 read capacity units. That is true for strongly consistent data.

For eventually consistent data two read items will count as one read item and thus cut down the cost by half. Makes sense since eventually consistent is cheaper in many ways.

For writes, the capacity unit is 1KB in size. So to write 10 KB/s you need at least 10 write capacity units.

Dynamo capacity units support bursting. Bursting will collect and make available the unused capacity units of the last 5 minutes. This will allow most to avoid throttling on sudden spikes. Burst capacity is available as a CloudWatch metric, luckily. Burst capacity units are used for underlying operations as maintenance, the documentation advises against leaning heavily on available burst capacity.

There’s no constraint in the number of scale-up actions but only 4 closely timed scale-down actions are allowed per day. Afterwards step-down auto-scaling is allowed once every two hours. So although scale-up is instant, scale down will take some time.

Batch reads and writes have the same cost as the sum of the batch. To improve throughput performance and avoid throttling distributed write/read activity is advised.

When throttling happens even though the provisioned capacity is bigger than the consumed capacity, the reason is a hot partition. A hot partition, is a partitions which performs high number of ops compared to other partitions. That is a bit tricky to solve when we have only one partition. The solution in that case is to decrease the target utilization. Partitioning is handled automatically by dynamoDB. One partition takes up to 3k items. The next item will be written in a new partition. Ideally, writes and reads should be distributed across partitions.

I noticed that for small write capacity units (e.g. 20-30), you’ll hit throttling for any target utilization above 20%. Note that 20% target utilization, meaning 80% buffer, is the lowest available. Currently I’ve played with low number of capacity writes, from hundreds to few thousands, but the pattern is clear: the higher you go in capacity units the smaller your buffer can be, without throttling. For example a target value of 80% in 10k writes capacity will give you less, if any, throttling compared to 80% target value in 50 or 100 write capacity units.