DAT-403R DynamoDB deep dive: Advanced design patterns (re:Invent 2018) – Key Takeaways

The Key

  • NoSQL vs. SQL data modeling with examples

The Takeaways

  • Partition overloading
    • PK = CustomerID
    • Use many types of SKs
      • when SK = CustomerID, the data is customer account info
      • when SK = DateTime + #O1, the data is an order
      • when SK = DateTime + #O1 + #L1, the data is an item in the order
    • Take advantage of the schemaless feature of DynamoDB
    • Avoid JOINs and complex workflow
  • Write sharding
    • DynamoDB scales infinitely by splitting storage and throughput into partitions
    • It is very important to spread data evenly to avoid “hot” partitions
    • You may do so by adding suffixes to your partition key values
      • Random suffix (e.g. #1 to #200)= most even distribution, good for aggregation, user query all suffixes and combine results
      • Calculated suffix based on an attribute value = for querying on some attributes
    • This sends data to different partitions to avoid hot partitions
  • Index overloading
    • Go beyond the limit of 20 GSIs, use generic names for data attributes and overload them
    • A Data field may contain name for employee item, and location for office item, etc. (see example table below)
  • Document vs. wide column
  • Order results by sort key, and need to filter on a 3rd attribute
    • Approach 1: use conditional filter
      • this happens after results are read, sp you still pay for all results before filtering
    • Approach 2: create a composite key, like DateGender, OccupationBirthday, etc.
  • Modeling complex relations
    • List all possible access patterns
    • Work out primary keys, indexes, key conditions, filter conditions
  • Optimizing for writes
    • When versioning an item, if most attributes do not change, there is no need to replicate the whole record, just store the delta

The Non-essentials

  • History of data processing
  • General overview of DynamoDB
  • Lots of diss on Cassandra and MongoDB, on their in-elasticity
  • Most if not all techniques mentioned in presentation are already there in the doc, but it is nonetheless good to hear a real person articulating them
  • Mentions that contains() is not supported on sort keys, but no mention of such restriction in official doc
  • NoSQL Workbench
  • NoSQL != Non-relational, data is still relational and always will be, just modeled and accessed differently
  • This guy was talking at 300% speed, I verified my player was not hyper-speeded

See Also

DynamoDB table structure
Note: the word “optional” next to sort key could be misleading. If you define both partition and sort key then you MUST provide values for both keys when inserting items, the two keys together forms a primary key (or composite key as it is called)
This is an amazing slides showing the partition overloading technique: converting JOINs into one “set” of items, using different sort keys to indicate the content of each row
Use secondary indexes to create access patterns for more fields
Querying all online orders (Source == "Online") on Amazon could make certain partitions extremely hot, to avoid that, append GSI partition key Source with numeric random suffixes (e.g. #1 to #1000) will split the data into multiple partitions (write sharding) thus prevent capacity overrun; this also promotes parallel processing
Table schema for GSI Overloading.
I found the slide in the presentation less clear than this one in the official doc; it clearly shows again how to exploit the schemaless feature of DynamoDB table by overloading attributes and create GSI keys on them (index overloading); one index’s sort key can be the other’s partition key; caution: you still need a way to clearly articulate the mini-structures of data
Note: this picture has its own ambiguity as well, we can overload the data and semantics of the Data attribute, create multiple GSIs using the attribute, but its data type cannot be overloaded (in this case it is a String).
What hot partitions look like visually
Well-designed partitions
You get a 5 minutes burst period by accumulating previously unused provisioned capacities
Movie, actor, book all in the same database, using the same PK attribute, data replicated, O(1) complexity, leave frequently-changing data in its own item but duplicate stable data to avoid JOINs
Modeling relational data, sort out your access patterns and list key / filter conditions, work out the keys
This is an example of how to document your tables with lots of overloaded indexes
Insurance quoting example, quotes need to be versioned, so user replicate the whole record
Alter the schema, write only deltas, 1K WCU to 50 WCU