AWS Certified Machine Learning – Specialty Dump 03

A Machine Learning Specialist built an image classification deep learning model. However, the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%, respectively.
How should the Specialist address this issue and what is the reason behind it?

  • A. The learning rate should be increased because the optimization process was trapped at a local minimum.
  • B. The dropout rate at the flatten layer should be increased because the model is not generalized enough.
  • C. The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough.
  • D. The epoch number should be increased because the optimization process was terminated before it reached the global minimum.

A = wrong, not local minimum (where learning is stagnated). C = wrong, making more layers will not help overfitting. D = wrong, it already reached minimum.


A Machine Learning team uses Amazon SageMaker to train an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to receive a notification when the model is overfitting. Auditors want to view the Amazon SageMaker log activity report to ensure there are no unauthorized API calls.
What should the Machine Learning team do to address the requirements with the least amount of code and fewest steps?

  • A. Implement an AWS Lambda function to log Amazon SageMaker API calls to Amazon S3. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting.
  • B. Use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting.
  • C. Implement an AWS Lambda function to log Amazon SageMaker API calls to AWS CloudTrail. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting.
  • D. Use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. Set up Amazon SNS to receive a notification when the model is overfitting

Note: When you see logging API call, go for CloudTrail. Overfitting is a custom metric not readily available on CloudWatch.


A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression.
During exploratory data analysis, the Specialist observes that many features are highly correlated with each other. This may make the model unstable.
What should be done to reduce the impact of having such a large number of features?

  • A. Perform one-hot encoding on highly correlated features.
  • B. Use matrix multiplication on highly correlated features.
  • C. Create a new feature space using principal component analysis (PCA)
  • D. Apply the Pearson correlation coefficient.

Note: This is about reducing dimensionality of parameters, use PCA.


A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes.
Which prior probability distribution should the ML Specialist use for this variable?

  • A. Poisson distribution
  • B. Uniform distribution
  • C. Normal distribution
  • D. Binomial distribution

Note: Poisson distribution roughly means when you known the average (3 minutes), the interval (10 minutes), and you can calculate a distribution (how many minutes one’s likely to wait).


A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy.
The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network.
How should the Data Science team configure the notebook instance placement to meet these requirements?

  • A. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Place the Amazon SageMaker endpoint and S3 buckets within the same VPC.
  • B. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Use IAM policies to grant access to Amazon S3 and Amazon SageMaker.
  • C. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it.
  • D. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has a NAT gateway and an associated security group allowing only outbound connections to Amazon S3 and Amazon SageMaker.

A = wrong, no such thing as S3 bucket within VPC. B = IAM do not grant access to S3. D = NAT means access to Internet.


A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data.
Which of the following methods should the Specialist consider using to correct this? (Choose three.)

  • A. Decrease regularization.
  • B. Increase regularization.
  • C. Increase dropout.
  • D. Decrease dropout.
  • E. Increase feature combinations.
  • F. Decrease feature combinations.

To prevent overfitting you increase regularization. Dropout is a kind of regularization. Also you decrease feature combinations, making the model simpler, less flexible thus less susceptible to specificity.

“Decrease feature combinations” sounds very ambiguous, when you combine less features, then number of features go up. However this is the wording used in official document, meaning “less feature-combinations” instead of “less combinations of feature”.

https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html


A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.
The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.
Which solution should the Data Scientist build to satisfy the requirements?

  • A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
  • B. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
  • C. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.
  • D. Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

B and C = wrong, S3 is unable to handle high-velocity, real-time data, a stream. D = Data Analytics will not convert format with SQL, and to save file to S3 it requires a lot of code.

https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/


An online reseller has a large, multi-column dataset with one column missing 30% of its data. A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data.
Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?

  • A. Listwise deletion
  • B. Last observation carried forward
  • C. Multiple imputation
  • D. Mean substitution

A = wrong, this means remove record with missing data, which is not good because 30% data is a lot, while missing data is connected to existing data and can be inferred. B = wrong, LOCF means use previously recorded value for missing data (use score from last semester if you did not take exam this semester), but this creates a lot of bias. D = wrong, this means use average value for a column to filling the missing data, this may introduce bias.


A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet.
How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances?

  • A. Create a NAT gateway within the corporate VPC.
  • B. Route Amazon SageMaker traffic through an on-premises network.
  • C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC.
  • D. Create VPC peering with Amazon VPC hosting Amazon SageMaker.

Note: No need for explanation. Get your self used to relate “no-access-to-Internet” to “endpoints”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s