Note: While I really enjoy the eye-widening concepts and ideas described in this presentation, the end result is a big mess for a human user. There is a huge tradeoff between efficiency of NoSQL and readability by human beings, not to mention the loss of flexibility.
The Key
- NoSQL vs. SQL data modeling with examples
The Takeaways
- Partition overloading
- PK =
CustomerID
- Use many types of SKs
- when SK =
CustomerID
, the data is customer account info - when SK =
DateTime + #O1
, the data is an order - when SK =
DateTime + #O1 + #L1
, the data is an item in the order
- when SK =
- Take advantage of the schemaless feature of DynamoDB
- Avoid
JOIN
s and complex workflow
- PK =
- Write sharding
- DynamoDB scales infinitely by splitting storage and throughput into partitions
- It is very important to spread data evenly to avoid “hot” partitions
- You may do so by adding suffixes to your partition key values
- Random suffix (e.g.
#1
to#200
)= most even distribution, good for aggregation, user query all suffixes and combine results - Calculated suffix based on an attribute value = for querying on some attributes
- Random suffix (e.g.
- This sends data to different partitions to avoid hot partitions
- Index overloading
- Go beyond the limit of 20 GSIs, use generic names for data attributes and overload them
- A
Data
field may contain name for employee item, and location for office item, etc. (see example table below)
- Document vs. wide column
- Order results by sort key, and need to filter on a 3rd attribute
- Approach 1: use conditional filter
- this happens after results are read, sp you still pay for all results before filtering
- Approach 2: create a composite key, like
DateGender
,OccupationBirthday
, etc.
- Approach 1: use conditional filter
- Modeling complex relations
- List all possible access patterns
- Work out primary keys, indexes, key conditions, filter conditions
- Optimizing for writes
- When versioning an item, if most attributes do not change, there is no need to replicate the whole record, just store the delta
- Do not store frequently changing data with non-changing attributes, as write operations will use the whole item size as its WCU consumption
- You can store them as two items sharing the same PK (partition key), use SK to differentiate when updating, and use PK to query and get both
The Non-essentials
- History of data processing
- General overview of DynamoDB
- Lots of diss on Cassandra and MongoDB, on their in-elasticity
- Most if not all techniques mentioned in presentation are already there in the doc, but it is nonetheless good to hear a real person articulating them
- Mentions that
contains()
is not supported on sort keys, but no mention of such restriction in official doc - NoSQL Workbench
- NoSQL != Non-relational, data is still relational and always will be, just modeled and accessed differently
- This guy was talking at 300% speed, I verified my player was not hyper-speeded
See Also


Note: the word “optional” next to sort key could be misleading. If you define both partition and sort key then you MUST provide values for both keys when inserting items, the two keys together forms a primary key (or composite key as it is called)

JOIN
s into one “set” of items, using different sort keys to indicate the content of each row

Source == "Online"
) on Amazon could make certain partitions extremely hot, to avoid that, append GSI partition key Source
with numeric random suffixes (e.g. #1
to #1000
) will split the data into multiple partitions (write sharding) thus prevent capacity overrun; this also promotes parallel processing
Note: this picture has its own ambiguity as well, we can overload the data and semantics of the
Data
attribute, create multiple GSIs using the attribute, but its data type cannot be overloaded (in this case it is a String
).



JOIN
s





