AWS Certified Data Analytics – Specialty Dump 05

A media company has been performing analytics on log data generated by its applications. There has been a recent increase in the number of concurrent analytics jobs running, and the overall performance of existing jobs is decreasing as the number of new jobs is increasing. The partitioned data is stored in
Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) and the analytic processing is performed on Amazon EMR clusters using the EMR File System
(EMRFS) with consistent view enabled. A data analyst has determined that it is taking longer for the EMR task nodes to list objects in Amazon S3.
Which action would MOST likely increase the performance of accessing log data in Amazon S3?

  • A. Use a hash function to create a random string and add that to the beginning of the object prefixes when storing the log data in Amazon S3.
  • B. Use a lifecycle policy to change the S3 storage class to S3 Standard for the log data.
  • C. Increase the read capacity units (RCUs) for the shared Amazon DynamoDB table.
  • D. Redeploy the EMR clusters that are running slowly to a different Availability Zone.

Note: The problem is about “to list objects in Amazon S3”. A is about S3 performance, B and D are irrelevant. The key is that consistent view uses a DynamoDB to track S3 files. There are many hidden uses of DynamoDB, e.g. Kinesis Client Library also uses a DynamoDB table to store the cursor and other metadata.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-metadata.html


A company has developed several AWS Glue jobs to validate and transform its data from Amazon S3 and load it into Amazon RDS for MySQL in batches once every day. The ETL jobs read the S3 data using a DynamicFrame. Currently, the ETL developers are experiencing challenges in processing only the incremental data on every run, as the AWS Glue job processes all the S3 input data on each run.
Which approach would allow the developers to solve the issue with minimal coding effort?

  • A. Have the ETL jobs read the data from Amazon S3 using a DataFrame.
  • B. Enable job bookmarks on the AWS Glue jobs.
  • C. Create custom logic on the ETL jobs to track the processed S3 objects.
  • D. Have the ETL jobs delete the processed objects or data from Amazon S3 after each run.

Note: This is a textbook question.

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html


A mortgage company has a microservice for accepting payments. This microservice uses the Amazon DynamoDB encryption client with AWS KMS managed keys to encrypt the sensitive data before writing the data to DynamoDB. The finance team should be able to load this data into Amazon Redshift and aggregate the values within the sensitive fields. The Amazon Redshift cluster is shared with other data analysts from different business units.
Which steps should a data analyst take to accomplish this task efficiently and securely?

  • A. Create an AWS Lambda function to process the DynamoDB stream. Decrypt the sensitive data using the same KMS key. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPY command to load the data from Amazon S3 to the finance table.
  • B. Create an AWS Lambda function to process the DynamoDB stream. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPY command with the IAM role that has access to the KMS key to load the data from S3 to the finance table.
  • C. Create an Amazon EMR cluster with an EMR_EC2_DefaultRole role that has access to the KMS key. Create Apache Hive tables that reference the data stored in DynamoDB and the finance table in Amazon Redshift. In Hive, select the data from DynamoDB and then insert the output to the finance table in Amazon Redshift.
  • D. Create an Amazon EMR cluster. Create Apache Hive tables that reference the data stored in DynamoDB. Insert the output to the restricted Amazon S3 bucket for the finance team. Use the COPY command with the IAM role that has access to the KMS key to load the data from Amazon S3 to the finance table in Amazon Redshift.

Note: This is not a good question. C and D can be eliminated because this is a shared Redshift cluster, so you need to create a table accessible only to the finance team. B is wrong as the application uses DynamoDB client-side encryption (not S3 client-side encryption), which means it will not automatically decrypt by AWS, and needs manual decryption before sending to S3 and then COPY’d into Redshift. Even if you want to use COPY ENCRYPTED to copy client-side encrypted S3 files, you need to specify credentials not IAM roles.

However, DynamoDB stream only does new data, so existing data won’t be processed, this is not a perfect answer.

https://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html


A company is building a data lake and needs to ingest data from a relational database that has time-series data. The company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental data only from the source into Amazon S3.
What is the MOST cost-effective approach to meet these requirements?

  • A. Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.
  • B. Use AWS Glue to connect to the data source using JDBC Drivers. Store the last updated key in an Amazon DynamoDB table and ingest the data using the updated key as a filter.
  • C. Use AWS Glue to connect to the data source using JDBC Drivers and ingest the entire dataset. Use appropriate Apache Spark libraries to compare the dataset, and find the delta.
  • D. Use AWS Glue to connect to the data source using JDBC Drivers and ingest the full data. Use AWS DataSync to ensure the delta only is written into Amazon S3.

Note: This is a textbook question.


An Amazon Redshift database contains sensitive user data. Logging is necessary to meet compliance requirements. The logs must contain database authentication attempts, connections, and disconnections. The logs must also contain each query run against the database and record which database user ran each query.
Which steps will create the required logs?

  • A. Enable Amazon Redshift Enhanced VPC Routing. Enable VPC Flow Logs to monitor traffic.
  • B. Allow access to the Amazon Redshift database using AWS IAM only. Log access using AWS CloudTrail.
  • C. Enable audit logging for Amazon Redshift using the AWS Management Console or the AWS CLI.
  • D. Enable and download audit reports from AWS Artifact.

A = wrong, enhanced VPC routing means data in/out within VPC. B = wrong, CloudTrail do not log data events, only configuration events. D = wrong, nonsense. This is a textbook question.

https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s