MLS-C01 : AWS Certified Machine Learning – Specialty : Part 06

  1. A technology startup is using complex deep neural networks and GPU compute to recommend the company’s products to its existing customers based upon each customer’s habits and interactions. The solution currently pulls each dataset from an Amazon S3 bucket before loading the data into a TensorFlow model pulled from the company’s Git repository that runs locally. This job then runs for several hours while continually outputting its progress to the same S3 bucket. The job can be paused, restarted, and continued at any time in the event of a failure, and is run from a central queue.

    Senior managers are concerned about the complexity of the solution’s resource management and the costs involved in repeating the process regularly. They ask for the workload to be automated so it runs once a week, starting Monday and completing by the close of business Friday.

    Which architecture should be used to scale the solution at the lowest cost?

    • Implement the solution using AWS Deep Learning Containers and run the container as a job using AWS Batch on a GPU-compatible Spot Instance
    • Implement the solution using a low-cost GPU-compatible Amazon EC2 instance and use the AWS Instance Scheduler to schedule the task
    • Implement the solution using AWS Deep Learning Containers, run the workload using AWS Fargate running on Spot Instances, and then schedule the task using the built-in task scheduler
    • Implement the solution using Amazon ECS running on Spot Instances and schedule the task using the ECS service scheduler
  2. A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1..10]:

    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q02 011
    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q02 011

    Considering the graph, what is a reasonable selection for the optimal choice of k?

    • 1
    • 4
    • 7
    • 10
  3. A media company with a very large archive of unlabeled images, text, audio, and video footage wishes to index its assets to allow rapid identification of relevant content by the Research team. The company wants to use machine learning to accelerate the efforts of its in-house researchers who have limited machine learning expertise.

    Which is the FASTEST route to index the assets?

    • Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes.
    • Create a set of Amazon Mechanical Turk Human Intelligence Tasks to label all footage.
    • Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes.
    • Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio transcription and topic modeling, and use object detection to tag data into distinct categories/classes.
  4. A Machine Learning Specialist is working for an online retailer that wants to run analytics on every customer visit, processed through a machine learning pipeline. The data needs to be ingested by Amazon Kinesis Data Streams at up to 100 transactions per second, and the JSON data blob is 100 KB in size.

    What is the MINIMUM number of shards in Kinesis Data Streams the Specialist should use to successfully ingest this data?

    • 1 shards
    • 10 shards
    • 100 shards
    • 1,000 shards
  5. A Machine Learning Specialist is deciding between building a naive Bayesian model or a full Bayesian network for a classification problem. The Specialist computes the Pearson correlation coefficients between each feature and finds that their absolute values range between 0.1 to 0.95.

    Which model describes the underlying data in this situation?

    • A naive Bayesian model, since the features are all conditionally independent.
    • A full Bayesian network, since the features are all conditionally independent.
    • A naive Bayesian model, since some of the features are statistically dependent.
    • A full Bayesian network, since some of the features are statistically dependent.
  6. A Data Scientist is building a linear regression model and will use resulting p-values to evaluate the statistical significance of each coefficient. Upon inspection of the dataset, the Data Scientist discovers that most of the features are normally distributed. The plot of one feature in the dataset is shown in the graphic.

    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q06 012
    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q06 012

    What transformation should the Data Scientist apply to satisfy the statistical assumptions of the linear regression model?

    • Exponential transformation
    • Logarithmic transformation
    • Polynomial transformation
    • Sinusoidal transformation
  7. A Machine Learning Specialist is assigned to a Fraud Detection team and must tune an XGBoost model, which is working appropriately for test data. However, with unknown data, it is not working as expected. The existing parameters are provided as follows.

    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q07 013
    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q07 013

    Which parameter tuning guidelines should the Specialist follow to avoid overfitting?

    • Increase the max_depth parameter value.
    •  Lower the max_depth parameter value.
    • Update the objective to binary:logistic.
    • Lower the min_child_weight parameter value.
  8. A data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist needs to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns will be used downstream for alerting and incident response. The data scientist has access to unlabeled historic data to use, if needed.

    The solution needs to do the following:

    – Calculate an anomaly score for each web traffic entry.
    – Adapt unusual event identification to changing web patterns over time.

    Which approach should the data scientist implement to meet these requirements?

    • Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random Cut Forest (RCF) built-in model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the RCF model to calculate the anomaly score for each record.
    • Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker built-in XGBoost model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the XGBoost model to calculate the anomaly score for each record.
    • Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the k-Nearest Neighbors (kNN) SQL extension to calculate anomaly scores for each record using a tumbling window.
    • Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the Amazon Random Cut Forest (RCF) SQL extension to calculate anomaly scores for each record using a sliding window.
  9. A Data Scientist received a set of insurance records, each consisting of a record ID, the final outcome among 200 categories, and the date of the final outcome. Some partial information on claim contents is also provided, but only for a few of the 200 categories. For each outcome category, there are hundreds of records distributed over the past 3 years. The Data Scientist wants to predict how many claims to expect in each category from month to month, a few months in advance.

    What type of machine learning model should be used?

    • Classification month-to-month using supervised learning of the 200 categories based on claim contents.
    • Reinforcement learning using claim IDs and timestamps where the agent will identify how many claims in each category to expect from month to month.
    • Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month.
    • Classification with supervised learning of the categories for which partial information on claim contents is provided, and forecasting using claim IDs and timestamps for all other categories.
  10. A company that promotes healthy sleep patterns by providing cloud-connected devices currently hosts a sleep tracking application on AWS. The application collects device usage information from device users. The company’s Data Science team is building a machine learning model to predict if and when a user will stop utilizing the company’s devices. Predictions from this model are used by a downstream application that determines the best approach for contacting users.

    The Data Science team is building multiple versions of the machine learning model to evaluate each version against the company’s business goals. To measure long-term effectiveness, the team wants to run multiple versions of the model in parallel for long periods of time, with the ability to control the portion of inferences served by the models.

    Which solution satisfies these requirements with MINIMAL effort?

    • Build and host multiple models in Amazon SageMaker. Create multiple Amazon SageMaker endpoints, one for each model. Programmatically control invoking different models for inference at the application layer.
    • Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration.
    • Build and host multiple models in Amazon SageMaker Neo to take into account different types of medical devices. Programmatically control which model is invoked for inference based on the medical device type.
    • Build and host multiple models in Amazon SageMaker. Create a single endpoint that accesses multiple models. Use Amazon SageMaker batch transform to control invoking the different models through the single endpoint.
  11. An agricultural company is interested in using machine learning to detect specific types of weeds in a 100-acre grassland field. Currently, the company uses tractor-mounted cameras to capture multiple images of the field as 10 × 10 grids. The company also has a large training dataset that consists of annotated images of popular weed classes like broadleaf and non-broadleaf docks.

    The company wants to build a weed detection model that will detect specific types of weeds and the location of each type within the field. Once the model is ready, it will be hosted on Amazon SageMaker endpoints. The model will perform real-time inferencing using the images captured by the cameras.

    Which approach should a Machine Learning Specialist take to obtain accurate predictions?

    • Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes.
    • Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object-detection single-shot multibox detector (SSD) algorithm.
    • Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object-detection single-shot multibox detector (SSD) algorithm.
    • Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes.
  12. A manufacturer is operating a large number of factories with a complex supply chain relationship where unexpected downtime of a machine can cause production to stop at several factories. A data scientist wants to analyze sensor data from the factories to identify equipment in need of preemptive maintenance and then dispatch a service team to prevent unplanned downtime. The sensor readings from a single machine can include up to 200 data points including temperatures, voltages, vibrations, RPMs, and pressure readings.

    To collect this sensor data, the manufacturer deployed Wi-Fi and LANs across the factories. Even though many factory locations do not have reliable or high-speed internet connectivity, the manufacturer would like to maintain near-real-time inference capabilities.

    Which deployment architecture for the model will address these business requirements?

    • Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines need maintenance.
    • Deploy the model on AWS IoT Greengrass in each factory. Run sensor data through this model to infer which machines need maintenance.
    • Deploy the model to an Amazon SageMaker batch transformation job. Generate inferences in a daily batch report to identify machines that need maintenance.
    • Deploy the model in Amazon SageMaker and use an IoT rule to write data to an Amazon DynamoDB table. Consume a DynamoDB stream from the table with an AWS Lambda function to invoke the endpoint.
  13. A Machine Learning Specialist is designing a scalable data storage solution for Amazon SageMaker. There is an existing TensorFlow-based model implemented as a train.py script that relies on static training data that is currently stored as TFRecords.

    Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead?

    • Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data.
    • Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data.
    • Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords.
    • Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket.
  14. The chief editor for a product catalog wants the research and development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company’s retail brand. The team has a set of training data.

    Which machine learning algorithm should the researchers use that BEST meets their requirements?

    • Latent Dirichlet Allocation (LDA)
    • Recurrent neural network (RNN)
    • K-means
    • Convolutional neural network (CNN)
  15. A retail company is using Amazon Personalize to provide personalized product recommendations for its customers during a marketing campaign. The company sees a significant increase in sales of recommended items to existing customers immediately after deploying a new solution version, but these sales decrease a short time after deployment. Only historical data from before the marketing campaign is available for training.

    How should a data scientist adjust the solution?

    • Use the event tracker in Amazon Personalize to include real-time user interactions.
    • Add user metadata and use the HRNN-Metadata recipe in Amazon Personalize.
    • Implement a new solution using the built-in factorization machines (FM) algorithm in Amazon SageMaker.
    • Add event type and event value fields to the interactions dataset in Amazon Personalize.
  16. A machine learning (ML) specialist wants to secure calls to the Amazon SageMaker Service API. The specialist has configured Amazon VPC with a VPC interface endpoint for the Amazon SageMaker Service API and is attempting to secure traffic from specific sets of instances and IAM users. The VPC is configured with a single public subnet.

    Which combination of steps should the ML specialist take to secure the traffic? (Choose two.)

    • Add a VPC endpoint policy to allow access to the IAM users.
    • Modify the users’ IAM policy to allow access to Amazon SageMaker Service API calls only.
    • Modify the security group on the endpoint network interface to restrict access to the instances.
    • Modify the ACL on the endpoint network interface to restrict access to the instances.
    • Add a SageMaker Runtime VPC endpoint interface to the VPC.
  17. An e commerce company wants to launch a new cloud-based product recommendation feature for its web application. Due to data localization regulations, any sensitive data must not leave its on-premises data center, and the product recommendation model must be trained and tested using nonsensitive data only. Data transfer to the cloud must use IPsec. The web application is hosted on premises with a PostgreSQL database that contains all the data. The company wants the data to be uploaded securely to Amazon S3 each day for model retraining.

    How should a machine learning specialist meet these requirements?

    • Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest tables without sensitive data through an AWS Site-to-Site VPN connection directly into Amazon S3.
    • Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest all data through an AWS Site-to-Site VPN connection into Amazon S3 while removing sensitive data using a PySpark job.
    • Use AWS Database Migration Service (AWS DMS) with table mapping to select PostgreSQL tables with no sensitive data through an SSL connection. Replicate data directly into Amazon S3.
    • Use PostgreSQL logical replication to replicate all data to PostgreSQL in Amazon EC2 through AWS Direct Connect with a VPN connection. Use AWS Glue to move data from Amazon EC2 to Amazon S3.
  18. A logistics company needs a forecast model to predict next month’s inventory requirements for a single item in 10 warehouses. A machine learning specialist uses Amazon Forecast to develop a forecast model from 3 years of monthly data. There is no missing data. The specialist selects the DeepAR+ algorithm to train a predictor. The predictor means absolute percentage error (MAPE) is much larger than the MAPE produced by the current human forecasters.

    Which changes to the CreatePredictor API call could improve the MAPE? (Choose two.)

    • Set PerformAutoML to true.
    • Set ForecastHorizon to 4.
    • Set ForecastFrequency to W for weekly.
    • Set PerformHPO to true.
    • Set FeaturizationMethodName to filling.
  19. A data scientist wants to use Amazon Forecast to build a forecasting model for inventory demand for a retail company. The company has provided a dataset of historic inventory demand for its products as a .csv file stored in an Amazon S3 bucket. The table below shows a sample of the dataset.

    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q19 014
    MLS-C01 AWS Certified Machine Learning – Specialty Part 06 Q19 014

    How should the data scientist transform the data?

    • Use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. Upload both datasets as .csv files to Amazon S3.
    • Use a Jupyter notebook in Amazon SageMaker to separate the dataset into a related time series dataset and an item metadata dataset. Upload both datasets as tables in Amazon Aurora.
    • Use AWS Batch jobs to separate the dataset into a target time series dataset, a related time series dataset, and an item metadata dataset. Upload them directly to Forecast from a local machine.
    • Use a Jupyter notebook in Amazon SageMaker to transform the data into the optimized protobuf recordIO format. Upload the dataset in this format to Amazon S3.
  20. A machine learning specialist is running an Amazon SageMaker endpoint using the built-in object detection algorithm on a P3 instance for real-time predictions in a company’s production application. When evaluating the model’s resource utilization, the specialist notices that the model is using only a fraction of the GPU.

    Which architecture changes would ensure that provisioned resources are being utilized effectively?

    • Redeploy the model as a batch transform job on an M5 instance.
    • Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance.
    • Redeploy the model on a P3dn instance.
    • Deploy the model onto an Amazon Elastic Container Service (Amazon ECS) cluster using a P3 instance.
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments