AWS Certified Data Analytics – Specialty Dump 04

A company has developed an Apache Hive script to batch process data stared in Amazon S3. The script needs to run once every day and store the output in Amazon S3. The company tested the script, and it completes within 30 minutes on a small local three-node cluster.
Which solution is the MOST cost-effective for scheduling and executing the script?

  • A. Create an AWS Lambda function to spin up an Amazon EMR cluster with a Hive execution step. Set KeepJobFlowAliveWhenNoSteps to false and disable the termination protection flag. Use Amazon CloudWatch Events to schedule the Lambda function to run daily.
  • B. Use the AWS Management Console to spin up an Amazon EMR cluster with Python Hue. Hive, and Apache Oozie. Set the termination protection flag to true and use Spot Instances for the core nodes of the cluster. Configure an Oozie workflow in the cluster to invoke the Hive script daily.
  • C. Create an AWS Glue job with the Hive script to perform the batch operation. Configure the job to run once a day using a time-based schedule.
  • D. Use AWS Lambda layers and load the Hive runtime to AWS Lambda and copy the Hive script. Schedule the Lambda function to run daily by creating a workflow using AWS Step Functions.

B = wrong, termination flag should be off, spot instances not good for core nodes. C = Glue runs on Spark, cannot run Hive scripts. D = wrong, Lambda maximum running time 15 minutes.


A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon Redshift data warehouse for frequent analysis. The data volume is up to 500 GB per day.
Which solution will improve the data loading performance?

  • A. Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift.
  • B. Split large .csv files, then use a COPY command to load data into Amazon Redshift.
  • C. Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.
  • D. Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift.

A = wrong, compression means download file from S3 then compress, time-consuming, also you use COPY for files not INSERT. C = wrong, will not improve performance. D = wrong, vacuum frees up storage. This is a question about parallel loading.


A company analyzes its data in an Amazon Redshift data warehouse, which currently has a cluster of three dense storage nodes. Due to a recent business acquisition, the company needs to load an additional 4 TB of user data into Amazon Redshift. The engineering team will combine all the user data and apply complex calculations that require I/O intensive resources. The company needs to adjust the cluster’s capacity to support the change in analytical and storage requirements.
Which solution meets these requirements?

  • A. Resize the cluster using elastic resize with dense compute nodes.
  • B. Resize the cluster using classic resize with dense compute nodes.
  • C. Resize the cluster using elastic resize with dense storage nodes.
  • D. Resize the cluster using classic resize with dense storage nodes.

Note: This is a very bad question as it focuses on a property that gets updated recently. C and D are immediately out because DS nodes use HDD which does not have good I/O. Between classic and elastic resizing is difficult. There had been an update just a few days before the Data Analytics Certification launch, making elastic resizing the go-to solution for all resizing. It is better stick with the update as it happens before the launch, although there is no guarantee.


A company stores its sales and marketing data that includes personally identifiable information (PII) in Amazon S3. The company allows its analysts to launch their own Amazon EMR cluster and run analytics reports with the data. To meet compliance requirements, the company must ensure the data is not publicly accessible throughout this process. A data engineer has secured Amazon S3 but must ensure the individual EMR clusters created by the analysts are not exposed to the public internet.
Which solution should the data engineer to meet this compliance requirement with LEAST amount of effort?

  • A. Create an EMR security configuration and ensure the security configuration is associated with the EMR clusters when they are created.
  • B. Check the security group of the EMR clusters regularly to ensure it does not allow inbound traffic from IPv4 0.0.0.0/0 or IPv6 ::/0.
  • C. Enable the block public access setting for Amazon EMR at the account level before any EMR cluster is created.
  • D. Use AWS WAF to block public internet access to the EMR clusters across the board.

Note: This is a textbook question.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-block-public-access.html


A financial company uses Amazon S3 as its data lake and has set up a data warehouse using a multi-node Amazon Redshift cluster. The data files in the data lake are organized in folders based on the data source of each data file. All the data files are loaded to one table in the Amazon Redshift cluster using a separate
COPY command for each data file location. With this approach, loading all the data files into Amazon Redshift takes a long time to complete. Users want a faster solution with little or no increase in cost while maintaining the segregation of the data files in the S3 data lake.
Which solution meets these requirements?

  • A. Use Amazon EMR to copy all the data files into one folder and issue a COPY command to load the data into Amazon Redshift.
  • B. Load all the data files in parallel to Amazon Aurora, and run an AWS Glue job to load the data into Amazon Redshift.
  • C. Use an AWS Glue job to copy all the data files into one folder and issue a COPY command to load the data into Amazon Redshift.
  • D. Create a manifest file that contains the data file locations and issue a COPY command to load the data into Amazon Redshift.

A = wrong, no segregation, increased cost. B = wrong, no segregation, unnecessary work, increased cost. C = wrong, no segregation, increased cost. This is a question on how COPY command work. In general you should use only one COPY command because Redshift will load data in parallel, if you use many COPYs Redshift will have to load data in sequential manner.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html

https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#copy-command-examples-manifest


A company’s marketing team has asked for help in identifying a high performing long-term storage service for their data based on the following requirements:
– The data size is approximately 32 TB uncompressed.
– There is a low volume of single-row inserts each day.
– There is a high volume of aggregation queries each day.
– Multiple complex joins are performed.
– The queries typically involve a small subset of the columns in a table.
Which storage service will provide the MOST performant solution?

  • A. Amazon Aurora MySQL
  • B. Amazon Redshift
  • C. Amazon Neptune
  • D. Amazon Elasticsearch

A = wrong, Aurora for OLTP not OLAP. C = wrong, graph database not relevant. D = wrong, complex joins in ES are expensive.


A technology company is creating a dashboard that will visualize and analyze time-sensitive data. The data will come in through Amazon Kinesis Data Firehose with the butter interval set to 60 seconds. The dashboard must support near-real-time data.
Which visualization solution will meet these requirements?

  • A. Select Amazon Elasticsearch Service (Amazon ES) as the endpoint for Kinesis Data Firehose. Set up a Kibana dashboard using the data in Amazon ES with the desired analyses and visualizations.
  • B. Select Amazon S3 as the endpoint for Kinesis Data Firehose. Read data into an Amazon SageMaker Jupyter notebook and carry out the desired analyses and visualizations.
  • C. Select Amazon Redshift as the endpoint for Kinesis Data Firehose. Connect Amazon QuickSight with SPICE to Amazon Redshift to create the desired analyses and visualizations.
  • D. Select Amazon S3 as the endpoint for Kinesis Data Firehose. Use AWS Glue to catalog the data and Amazon Athena to query it. Connect Amazon QuickSight with SPICE to Athena to create the desired analyses and visualizations.

B = wrong, this is not real-time. C = wrong, SPICE cannot automatically and continuously update data. D = wrong, in addition to SPICE, Glue is not real-time.


A financial company uses Apache Hive on Amazon EMR for ad-hoc queries. Users are complaining of sluggish performance.
A data analyst notes the following:
– Approximately 90% of queries are submitted 1 hour after the market opens.
Hadoop Distributed File System (HDFS) utilization never exceeds 10%.

Which solution would help address the performance issues?

  • A. Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch CapacityRemainingGB metric.
  • B. Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch YARNMemoryAvailablePercentage metric.
  • C. Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch CapacityRemainingGB metric.
  • D. Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch YARNMemoryAvailablePercentage metric.

A and B = wrong, instance fleet does not support auto scaling. C = wrong, HDFS utilization never exceeds 10% no scaling will never happen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s