AWS Certified Machine Learning – Specialty Dump 05

A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team.
Which solution requires the LEAST coding effort?

  • A. Run daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3. Give the Business team read-only access to S3.
  • B. Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team.
  • C. Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3. Visualize the arrays in Amazon QuickSight, and publish them in a dashboard shared with the Business team.
  • D. Generate daily precision-recall data in Amazon ES, and publish the results in a dashboard shared with the Business team.

A = wrong, no visualization. B = wrong, QuickSight cannot generate data, data as to be fed. D = wrong, lots of effort.


A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is using one of the SageMaker built-in algorithms for the training. The dataset is stored in .CSV format and is transformed into a numpy.array, which appears to be negatively affecting the speed of the training.
What should the Specialist do to optimize the data for training on SageMaker?

  • A. Use the SageMaker batch transform feature to transform the training data into a DataFrame.
  • B. Use AWS Glue to compress the data into the Apache Parquet format.
  • C. Transform the dataset into the RecordIO protobuf format.
  • D. Use the SageMaker hyperparameter optimization feature to automatically optimize the data.

Note: This is a free score question.


A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier:
Total number of images available = 1,000
Test set images = 100 (constant test set)
The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners.
Which techniques can be used by the ML Specialist to improve this specific test error?

  • A. Increase the training data by adding variation in rotation for training images.
  • B. Increase the number of epochs for model training
  • C. Increase the number of layers for the neural network.
  • D. Increase the dropout rate for the second-to-last layer.

Note: This question is about augmenting training data. This could be automatically done with a simple parameter if built-in algorithm is used. Tuning the model won’t make it learn to recognize transformed data, you have to train it with transformed data.


A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis.
Which of the following services would both ingest and store this data in the correct format?

  • A. AWS DMS
  • B. Amazon Kinesis Data Streams
  • C. Amazon Kinesis Data Firehose
  • D. Amazon Kinesis Data Analytics

Note: When you see “ingest and store”, go for Firehose.


A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of
100,000 non-fraudulent observations and 1,000 fraudulent observations.
The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.

Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Choose two.)

  • A. Change the XGBoost eval_metric parameter to optimize based on rmse instead of error.
  • B. Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights.
  • C. Increase the XGBoost max_depth parameter because the model is currently underfitting the data.
  • D. Change the XGBoost eval_metric parameter to optimize based on AUC instead of error.
  • E. Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.

A = wrong, RMSE does not play a part in this type of problem. C and E = wrong, overfitting and underfitting can be seen only by comparing results of training set to test set, but accuracy os 99.1% on unseen data (test set), which means this is a very good model.


A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training, and needs to continue working for an extended period with no Wi-Fi access.
Which approach should the Specialist use to continue working?

  • A. Install Python 3 and boto3 on their laptop and continue the code development using that environment.
  • B. Download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment, and use the Amazon SageMaker Python SDK to test the code.
  • C. Download TensorFlow from tensorflow.org to emulate the TensorFlow kernel in the SageMaker environment.
  • D. Download the SageMaker notebook to their local environment, then install Jupyter Notebooks on their laptop and continue the development in a local notebook.

A = wrong, boto3 is used to access SageMaker API which is not available. C = wrong, cannot emulate SageMaker. D = wrong, only notebook, no API access.

https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/


A Machine Learning Specialist is working with a large cybersecurity company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested. The company also wants be able to save the results in its data lake for later processing and analysis.
What is the MOST efficient way to accomplish these tasks?

  • A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.
  • B. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake.
  • C. Ingest the data and store it in Amazon S3. Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3.
  • D. Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.

Note: When you see anomaly then choose RCF. Between A and D, D is not real time as Glue is manually triggered. A is vaguely versed though, typical AWS style. The first Firehose should be replaced by Data Stream.


A Data Scientist wants to gain real-time insights into a data stream of GZIP files.
Which solution would allow the use of SQL to query the stream with the LEAST latency?

  • A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
  • B. AWS Glue with a custom ETL script to transform the data.
  • C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.
  • D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.

B, C and D = wrong, lots of latency.

https://aws.amazon.com/about-aws/whats-new/2017/10/amazon-kinesis-analytics-can-now-pre-process-data-prior-to-running-sql-queries/


A retail company intends to use machine learning to categorize new products. A labeled dataset of current products was provided to the Data Science team. The dataset includes 1,200 products. The labeled dataset has 15 features for each product such as title dimensions, weight, and price. Each product is labeled as belonging to one of six categories such as books, games, electronics, and movies.
Which model should be used for categorizing new products using the provided dataset for training?

  • A. AnXGBoost model where the objective parameter is set to multi:softmax
  • B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer
  • C. A regression forest where the number of trees is set equal to the number of product categories
  • D. A DeepAR forecasting model based on a recurrent neural network (RNN)

B = wrong, for image classification. C = wrong, nonsense. D = wrong, for forecasting.


A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor, and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset.
Which tool should be used to improve the validation accuracy?

  • A. Amazon Comprehend syntax analysis and entity detection
  • B. Amazon SageMaker BlazingText cbow mode
  • C. Natural Language Toolkit (NLTK) stemming and stop word removal
  • D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer

A = wrong, syntax analysis and entity detection won’t help with low word frequency. B = wrong, cbow is not for this task. C = wrong, removing stop word will not help already low word frequency.


Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the
Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model.
What should the Specialist do to prepare the data for model training?

  • A. Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution.
  • B. Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude.
  • C. Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude.
  • D. Apply the orthogonal sparse bigram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.

Note: Data varies too much, go for normalization.


A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only.
How should the Machine Learning Specialist transform the dataset to minimize query runtime?

  • A. Convert the records to Apache Parquet format.
  • B. Convert the records to JSON format.
  • C. Convert the records to GZIP CSV format.
  • D. Convert the records to XML format.

Note: Columnar data format, free score.


A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes:

  • Start the workflow as soon as data is uploaded to Amazon S3. “¢ When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon
    S3.
  • Store the results of joining datasets in Amazon S3.
  • If one of the jobs fails, send a notification to the Administrator.

Which configuration will meet these requirements?

  • A. Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
  • B. Develop the ETL workflow using AWS Lambda to start an Amazon SageMaker notebook instance. Use a lifecycle configuration script to join the datasets and persist the results in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
  • C. Develop the ETL workflow using AWS Batch to trigger the start of ETL jobs when data is uploaded to Amazon S3. Use AWS Glue to join the datasets in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
  • D. Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as the data is uploaded to Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.

B = wrong, SageMaker is not for this, lifecycle not for this. C = wrong, Batch not for this. D = wrong, chaining Lambda is nonsense.


An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen.
Which combination of algorithms would provide the appropriate insights? (Select TWO.)

  • A. The factorization machines (FM) algorithm
  • B. The Latent Dirichlet Allocation (LDA) algorithm
  • C. The principal component analysis (PCA) algorithm
  • D. The k-means algorithm
  • E. The Random Cut Forest (RCF) algorithm

Note: PCA to reduce feature dimensionality and k-means to clustering.


A large consumer goods manufacturer has the following products on sale:

  • 34 different toothpaste variants
  • 48 different toothbrush variants
  • 43 different mouthwash variants

The entire sales history of all these products is available in Amazon S3. Currently, the company is using custom-built autoregressive integrated moving average
(ARIMA) models to forecast demand for these products. The company wants to predict the demand for a new product that will soon be launched.

Which solution should a Machine Learning Specialist apply?

  • A. Train a custom ARIMA model to forecast demand for the new product.
  • B. Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product.
  • C. Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product.
  • D. Train a custom XGBoost model to forecast demand for the new product.

A = wrong, ARIMA cannot forecast new product. C = k-means is for clustering. D = wrong, XGBoost is mostly for classification.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s