Last Updated on November 4, 2022 by InfraExam

DP-100 : Designing and Implementing a Data Science Solution on Azure : Part 08

  1. You create a binary classification model by using Azure Machine Learning Studio.

    You must tune hyperparameters by performing a parameter sweep of the model. The parameter sweep must meet the following requirements:

    – iterate all possible combinations of hyperparameters
    – minimize computing resources required to perform the sweep

    You need to perform a parameter sweep of the model.

    Which parameter sweep mode should you use?

    • Random sweep
    • Sweep clustering
    • Entire grid
    • Random grid

    Explanation:

    Maximum number of runs on random grid: This option also controls the number of iterations over a random sampling of parameter values, but the values are not generated randomly from the specified range; instead, a matrix is created of all possible combinations of parameter values and a random sampling is taken over the matrix. This method is more efficient and less prone to regional oversampling or undersampling.

    If you are training a model that supports an integrated parameter sweep, you can also set a range of seed values to use and iterate over the random seeds as well. This is optional, but can be useful for avoiding bias introduced by seed selection.

    Incorrect Answers:
    B: If you are building a clustering model, use Sweep Clustering to automatically determine the optimum number of clusters and other parameters.

    C: Entire grid: When you select this option, the module loops over a grid predefined by the system, to try different combinations and identify the best learner. This option is useful for cases where you don’t know what the best parameter settings might be and want to try all possible combination of values.

  2. You are building a recurrent neural network to perform a binary classification.

    You review the training loss, validation loss, training accuracy, and validation accuracy for each training epoch.

    You need to analyze model performance.

    You need to identify whether the classification model is overfitted.

    Which of the following is correct?

    • The training loss stays constant and the validation loss stays on a constant value and close to the training loss value when training the model.
    • The training loss decreases while the validation loss increases when training the model.
    • The training loss stays constant and the validation loss decreases when training the model.
    • The training loss increases while the validation loss decreases when training the model.
    Explanation:
    An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade.
  3. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You have a Python script named train.py in a local folder named scripts. The script trains a regression model by using scikit-learn. The script includes code to load a training data file which is also located in the scripts folder.

    You must run the script as an Azure ML experiment on a compute cluster named aml-compute.

    You need to configure the run to ensure that the environment includes the required packages for model training. You have instantiated a variable named aml-compute that references the target compute cluster.

    Solution: Run the following code:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q03 139
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q03 139

    Does the solution meet the goal?

    •  Yes
    • No
    Explanation:

    There is a missing line: conda_packages=[‘scikit-learn’], which is needed.

    Correct example:
    sk_est = Estimator(source_directory=’./my-sklearn-proj’,
    script_params=script_params,
    compute_target=compute_target,
    entry_script=’train.py’,
    conda_packages=[‘scikit-learn’])

    Note:
    The Estimator class represents a generic estimator to train data using any supplied framework.

    This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn.

    Example:
    from azureml.train.estimator import Estimator

    script_params = {
    # to mount files referenced by mnist dataset
    ‘–data-folder’: ds.as_named_input(‘mnist’).as_mount(),
    ‘–regularization’: 0.8
    }

  4. You are performing clustering by using the K-means algorithm.

    You need to define the possible termination conditions.

    Which three conditions can you use? Each correct answer presents a complete solution.

    NOTE: Each correct selection is worth one point.

    • Centroids do not change between iterations.
    • The residual sum of squares (RSS) rises above a threshold.
    • The residual sum of squares (RSS) falls below a threshold.
    • A fixed number of iterations is executed.
    • The sum of distances between centroids reaches a maximum.
    Explanation:

    AD: The algorithm terminates when the centroids stabilize or when a specified number of iterations are completed.

    C: A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vector from its centroid summed over all vectors. RSS is the objective function and our goal is to minimize it.

  5. HOTSPOT

    You are using C-Support Vector classification to do a multi-class classification with an unbalanced training dataset. The C-Support Vector classification using Python code shown below:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q05 140
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q05 140

    You need to evaluate the C-Support Vector classification code.

    Which evaluation statement should you use? To answer, select the appropriate options in the answer area.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q05 141 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q05 141 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q05 141 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q05 141 Answer
    Explanation:

    Box 1: Automatically adjust weights inversely proportional to class frequencies in the input data
    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

    Box 2: Penalty parameter
    Parameter: C : float, optional (default=1.0)
    Penalty parameter C of the error term.

  6. You are building a machine learning model for translating English language textual content into French language textual content.

    You need to build and train the machine learning model to learn the sequence of the textual content.

    Which type of neural network should you use?

    • Multilayer Perceptions (MLPs)
    • Convolutional Neural Networks (CNNs)
    • Recurrent Neural Networks (RNNs)
    • Generative Adversarial Networks (GANs)
    Explanation:

    To translate a corpus of English text to French, we need to build a recurrent neural network (RNN).

    Note: RNNs are designed to take sequences of text as inputs or return sequences of text as outputs, or both. They’re called recurrent because the network’s hidden layers have a loop in which the output and cell state from each time step become inputs at the next time step. This recurrence serves as a form of memory. It allows contextual information to flow through the network so that relevant outputs from previous time steps can be applied to network operations at the current time step.

  7. You create a binary classification model.

    You need to evaluate the model performance.

    Which two metrics can you use? Each correct answer presents a complete solution.

    NOTE: Each correct selection is worth one point.

    • relative absolute error
    • precision
    • accuracy
    • mean absolute error
    • coefficient of determination
    Explanation:

    The evaluation metrics available for binary classification models are: Accuracy, Precision, Recall, F1 Score, and AUC.

    Note: A very natural question is: ‘Out of the individuals whom the model, how many were classified correctly (TP)?’
    This question can be answered by looking at the Precision of the model, which is the proportion of positives that are classified correctly.

  8. You create a script that trains a convolutional neural network model over multiple epochs and logs the validation loss after each epoch. The script includes arguments for batch size and learning rate.

    You identify a set of batch size and learning rate values that you want to try.

    You need to use Azure Machine Learning to find the combination of batch size and learning rate that results in the model with the lowest validation loss.

    What should you do?

    • Run the script in an experiment based on an AutoMLConfig object
    • Create a PythonScriptStep object for the script and run it in a pipeline
    • Use the Automated Machine Learning interface in Azure Machine Learning studio
    • Run the script in an experiment based on a ScriptRunConfig object
    • Run the script in an experiment based on a HyperDriveConfig object
  9. You use the Azure Machine Learning Python SDK to define a pipeline to train a model.

    The data used to train the model is read from a folder in a datastore.

    You need to ensure the pipeline runs automatically whenever the data in the folder changes.

    What should you do?

    • Set the regenerate_outputs property of the pipeline to True
    • Create a ScheduleRecurrance object with a Frequency of auto. Use the object to create a Schedule for the pipeline
    • Create a PipelineParameter with a default value that references the location where the training data is stored
    • Create a Schedule for the pipeline. Specify the datastore in the datastore property, and the folder containing the training data in the path_on_datastore property
  10. You plan to run a Python script as an Azure Machine Learning experiment.

    The script must read files from a hierarchy of folders. The files will be passed to the script as a dataset argument.

    You must specify an appropriate mode for the dataset argument.

    Which two modes can you use? Each correct answer presents a complete solution.

    NOTE: Each correct selection is worth one point.

    • to_pandas_dataframe()
    • as_download()
    • as_upload()
    • as_mount()
  11. Case study

    Overview

    You are a data scientist in a company that provides data science for professional sporting events. Models will use global and local market data to meet the following business goals:

    – Understand sentiment of mobile device users at sporting events based on audio from crowd reactions.
    – Assess a user’s tendency to respond to an advertisement.
    – Customize styles of ads served on mobile devices.
    – Use video to detect penalty events

    Current environment

    – Media used for penalty event detection will be provided by consumer devices. Media may include images and videos captured during the sporting event and shared using social media. The images and videos will have varying sizes and formats.
    – The data available for model building comprises of seven years of sporting event media. The sporting event media includes; recorded video transcripts or radio commentary, and logs from related social media feeds captured during the sporting events.
    – Crowd sentiment will include audio recordings submitted by event attendees in both mono and stereo formats.

    Penalty detection and sentiment

    – Data scientists must build an intelligent solution by using multiple machine learning models for penalty event detection.
    – Data scientists must build notebooks in a local environment using automatic feature engineering and model building in machine learning pipelines.
    – Notebooks must be deployed to retrain by using Spark instances with dynamic worker allocation.
    – Notebooks must execute with the same code on new Spark instances to recode only the source of the data.
    – Global penalty detection models must be trained by using dynamic runtime graph computation during training.
    – Local penalty detection models must be written by using BrainScript.
    Experiments for local crowd sentiment models must combine local penalty detection data.
    – Crowd sentiment models must identify known sounds such as cheers and known catch phrases. Individual crowd sentiment models will detect similar sounds.
    – All shared features for local models are continuous variables.
    – Shared features must use double precision. Subsequent layers must have aggregate running mean and standard deviation metrics available.

    Advertisements

    During the initial weeks in production, the following was observed:

    – Ad response rated declined.
    – Drops were not consistent across ad styles.
    – The distribution of features across training and production data are not consistent

    Analysis shows that, of the 100 numeric features on user location and behavior, the 47 features that come from location sources are being used as raw features. A suggested experiment to remedy the bias and variance issue is to engineer 10 linearly uncorrelated features.

    – Initial data discovery shows a wide range of densities of target states in training data used for crowd sentiment models.
    – All penalty detection models show inference phases using a Stochastic Gradient Descent (SGD) are running too slow.
    – Audio samples show that the length of a catch phrase varies between 25%-47% depending on region
    – The performance of the global penalty detection models shows lower variance but higher bias when comparing training and validation sets. Before implementing any feature changes, you must confirm the bias and variance using all training and validation cases.
    – Ad response models must be trained at the beginning of each event and applied during the sporting event.
    – Market segmentation models must optimize for similar ad response history.
    – Sampling must guarantee mutual and collective exclusively between local and global segmentation models that share the same features.
    – Local market segmentation models will be applied before determining a user’s propensity to respond to an advertisement.
    – Ad response models must support non-linear boundaries of features.
    – The ad propensity model uses a cut threshold is 0.45 and retrains occur if weighted Kappa deviated from 0.1 +/- 5%.
    – The ad propensity model uses cost factors shown in the following diagram:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 142
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 142

    – The ad propensity model uses proposed cost factors shown in the following diagram:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 143
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 143

    – Performance curves of current and proposed cost factor scenarios are shown in the following diagram:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 144
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 144
    1. You need to implement a scaling strategy for the local penalty detection data.

      Which normalization type should you use?

      • Streaming
      • Weight
      • Batch
      • Cosine
      Explanation:

      Post batch normalization statistics (PBN) is the Microsoft Cognitive Toolkit (CNTK) version of how to evaluate the population mean and variance of Batch Normalization which could be used in inference Original Paper.
      In CNTK, custom networks are defined using the BrainScriptNetworkBuilder and described in the CNTK network description language “BrainScript.”

      Scenario:
      Local penalty detection models must be written by using BrainScript.

    2. HOTSPOT

      You need to use the Python language to build a sampling strategy for the global penalty detection models.

      How should you complete the code segment? To answer, select the appropriate options in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 145 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 145 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 145 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 145 Answer
      Explanation:

      Box 1: import pytorch as deeplearninglib

      Box 2: ..DistributedSampler(Sampler)..
      DistributedSampler(Sampler):
      Sampler that restricts data loading to a subset of the dataset.
      It is especially useful in conjunction with class:`torch.nn.parallel.DistributedDataParallel`. In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.

      Scenario: Sampling must guarantee mutual and collective exclusively between local and global segmentation models that share the same features.

      Box 3: optimizer = deeplearninglib.train. GradientDescentOptimizer(learning_rate=0.10)

      Incorrect Answers: ..SGD..
      Scenario: All penalty detection models show inference phases using a Stochastic Gradient Descent (SGD) are running too slow.

      Box 4: .. nn.parallel.DistributedDataParallel..
      DistributedSampler(Sampler): The sampler that restricts data loading to a subset of the dataset.
      It is especially useful in conjunction with :class:`torch.nn.parallel.DistributedDataParallel`.

    3. DRAG DROP

      You need to define an evaluation strategy for the crowd sentiment models.

      Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 146 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 146 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 146 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 146 Answer
      Explanation:

      Scenario:
      Experiments for local crowd sentiment models must combine local penalty detection data.
      Crowd sentiment models must identify known sounds such as cheers and known catch phrases. Individual crowd sentiment models will detect similar sounds.

      Note: Evaluate the changed in correlation between model error rate and centroid distance
      In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.

    4. You need to implement a feature engineering strategy for the crowd sentiment local models.

      What should you do?

      • Apply an analysis of variance (ANOVA).
      • Apply a Pearson correlation coefficient.
      • Apply a Spearman correlation coefficient.
      • Apply a linear discriminant analysis.
      Explanation:

      The linear discriminant analysis method works only on continuous variables, not categorical or ordinal variables.

      Linear discriminant analysis is similar to analysis of variance (ANOVA) in that it works by comparing the means of the variables.

      Scenario:
      Data scientists must build notebooks in a local environment using automatic feature engineering and model building in machine learning pipelines.
      Experiments for local crowd sentiment models must combine local penalty detection data.
      All shared features for local models are continuous variables.

      Incorrect Answers:
      B: The Pearson correlation coefficient, sometimes called Pearson’s R test, is a statistical value that measures the linear relationship between two variables. By examining the coefficient values, you can infer something about the strength of the relationship between the two variables, and whether they are positively correlated or negatively correlated.

      C: Spearman’s correlation coefficient is designed for use with non-parametric and non-normally distributed data. Spearman’s coefficient is a nonparametric measure of statistical dependence between two variables, and is sometimes denoted by the Greek letter rho. The Spearman’s coefficient expresses the degree to which two variables are monotonically related. It is also called Spearman rank correlation, because it can be used with ordinal variables.

    5. DRAG DROP

      You need to define a modeling strategy for ad response.

      Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 147 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 147 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 147 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 147 Answer
      Explanation:

      Step 1: Implement a K-Means Clustering model

      Step 2: Use the cluster as a feature in a Decision jungle model.
      Decision jungles are non-parametric models, which can represent non-linear decision boundaries.

      Step 3: Use the raw score as a feature in a Score Matchbox Recommender model
      The goal of creating a recommendation system is to recommend one or more “items” to “users” of the system. Examples of an item could be a movie, restaurant, book, or song. A user could be a person, group of persons, or other entity with item preferences.

      Scenario:
      Ad response rated declined.
      Ad response models must be trained at the beginning of each event and applied during the sporting event.
      Market segmentation models must optimize for similar ad response history.
      Ad response models must support non-linear boundaries of features.

    6. DRAG DROP

      You need to define an evaluation strategy for the crowd sentiment models.

      Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 148 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 148 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 148 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 148 Answer
      Explanation:

      Step 1: Define a cross-entropy function activation
      When using a neural network to perform classification and prediction, it is usually better to use cross-entropy error than classification error, and somewhat better to use cross-entropy error than mean squared error to evaluate the quality of the neural network.

      Step 2: Add cost functions for each target state.

      Step 3: Evaluated the distance error metric.

    7. You need to implement a model development strategy to determine a user’s tendency to respond to an ad.

      Which technique should you use?

      • Use a Relative Expression Split module to partition the data based on centroid distance.
      • Use a Relative Expression Split module to partition the data based on distance travelled to the event.
      • Use a Split Rows module to partition the data based on distance travelled to the event.
      • Use a Split Rows module to partition the data based on centroid distance.
      Explanation:

      Split Data partitions the rows of a dataset into two distinct sets.
      The Relative Expression Split option in the Split Data module of Azure Machine Learning Studio is helpful when you need to divide a dataset into training and testing datasets using a numerical expression.

      Relative Expression Split: Use this option whenever you want to apply a condition to a number column. The number could be a date/time field, a column containing age or dollar amounts, or even a percentage. For example, you might want to divide your data set depending on the cost of the items, group people by age ranges, or separate data by a calendar date.

      Scenario:
      Local market segmentation models will be applied before determining a user’s propensity to respond to an advertisement.
      The distribution of features across training and production data are not consistent

    8. You need to implement a new cost factor scenario for the ad response models as illustrated in the performance curve exhibit.

      Which technique should you use?

      • Set the threshold to 0.5 and retrain if weighted Kappa deviates +/- 5% from 0.45.
      • Set the threshold to 0.05 and retrain if weighted Kappa deviates +/- 5% from 0.5.
      • Set the threshold to 0.2 and retrain if weighted Kappa deviates +/- 5% from 0.6.
      • Set the threshold to 0.75 and retrain if weighted Kappa deviates +/- 5% from 0.15.
      Explanation:

      Scenario:
      Performance curves of current and proposed cost factor scenarios are shown in the following diagram:

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 149
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q11 149

      The ad propensity model uses a cut threshold is 0.45 and retrains occur if weighted Kappa deviated from 0.1 +/- 5%.

  12. Case study

    This is a case study. Case studies are not timed separately. You can use as much exam time as you would like to complete each case. However, there may be additional case studies and sections on this exam. You must manage your time to ensure that you are able to complete all questions included on this exam in the time provided.

    To answer the questions included in a case study, you will need to reference information that is provided in the case study. Case studies might contain exhibits and other resources that provide more information about the scenario that is described in the case study. Each question is independent of the other questions in this case study.

    At the end of this case study, a review screen will appear. This screen allows you to review your answers and to make changes before you move to the next section of the exam. After you begin a new section, you cannot return to this section.

    To start the case study
    To display the first question in this case study, click the Next button. Use the buttons in the left pane to explore the content of the case study before you answer the questions. Clicking these buttons displays information such as business requirements, existing environment, and problem statements. If the case study has an All Information tab, note that the information displayed is identical to the information displayed on the subsequent tabs. When you are ready to answer a question, click the Question button to return to the question.

    Overview

    You are a data scientist for Fabrikam Residences, a company specializing in quality private and commercial property in the United States. Fabrikam Residences is considering expanding into Europe and has asked you to investigate prices for private residences in major European cities.
    You use Azure Machine Learning Studio to measure the median value of properties. You produce a regression model to predict property prices by using the Linear Regression and Bayesian Linear Regression modules.

    Datasets

    There are two datasets in CSV format that contain property details for two cities, London and Paris. You add both files to Azure Machine Learning Studio as separate datasets to the starting point for an experiment. Both datasets contain the following columns:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 150
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 150

    An initial investigation shows that the datasets are identical in structure apart from the MedianValue column. The smaller Paris dataset contains the MedianValue in text format, whereas the larger London dataset contains the MedianValue in numerical format.

    Data issues

    Missing values

    The AccessibilityToHighway column in both datasets contains missing values. The missing data must be replaced with new data so that it is modeled conditionally using the other variables in the data before filling in the missing values.

    Columns in each dataset contain missing and null values. The datasets also contain many outliers. The Age column has a high proportion of outliers. You need to remove the rows that have outliers in the Age column. The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail.

    Model fit

    The model shows signs of overfitting. You need to produce a more refined regression model that reduces the overfitting.

    Experiment requirements

    You must set up the experiment to cross-validate the Linear Regression and Bayesian Linear Regression modules to evaluate performance. In each case, the predictor of the dataset is the column named MedianValue. You must ensure that the datatype of the MedianValue column of the Paris dataset matches the structure of the London dataset.

    You must prioritize the columns of data for predicting the outcome. You must use non-parametric statistics to measure relationships.

    You must use a feature selection algorithm to analyze the relationship between the MedianValue and AvgRoomsInHouse columns.

    Model training

    Permutation Feature Importance

    Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You must be determined the absolute fit for the model.

    Hyperparameters

    You must configure hyperparameters in the model learning process to speed the learning phase. In addition, this configuration should cancel the lowest performing runs at each evaluation interval, thereby directing effort and resources towards models that are more likely to be successful.

    You are concerned that the model might not efficiently use compute resources in hyperparameter tuning. You also are concerned that the model might prevent an increase in the overall tuning time. Therefore, must implement an early stopping criterion on models that provides savings without terminating promising jobs.

    Testing

    You must produce multiple partitions of a dataset based on sampling using the Partition and Sample module in Azure Machine Learning Studio.

    Cross-validation

    You must create three equal partitions for cross-validation. You must also configure the cross-validation process so that the rows in the test and training datasets are divided evenly by properties that are near each city’s main river. You must complete this task before the data goes through the sampling process.

    Linear regression module

    When you train a Linear Regression module, you must determine the best features to use in a model. You can choose standard metrics provided to measure performance before and after the feature importance process completes. The distribution of features across multiple training models must be consistent.

    Data visualization

    You need to provide the test results to the Fabrikam Residences team. You create data visualizations to aid in presenting the results.

    You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test evaluation of the model. You need to select appropriate methods for producing the ROC curve in Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision Jungle modules with one another.

    1. HOTSPOT

      You need to replace the missing data in the AccessibilityToHighway columns.

      How should you configure the Clean Missing Data module? To answer, select the appropriate options in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 151 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 151 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 151 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 151 Answer
      Explanation:

      Box 1: Replace using MICE
      Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as “Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.

      Scenario: The AccessibilityToHighway column in both datasets contains missing values. The missing data must be replaced with new data so that it is modeled conditionally using the other variables in the data before filling in the missing values.

      Box 2: Propagate
      Cols with all missing values indicate if columns of all missing values should be preserved in the output.

    2. DRAG DROP

      You need to produce a visualization for the diagnostic test evaluation according to the data visualization requirements.

      Which three modules should you recommend be used in sequence? To answer, move the appropriate modules from the list of modules to the answer area and arrange them in the correct order.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 152 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 152 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 152 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 152 Answer
      Explanation:

      Step 1: Sweep Clustering
      Start by using the “Tune Model Hyperparameters” module to select the best sets of parameters for each of the models we’re considering.

      One of the interesting things about the “Tune Model Hyperparameters” module is that it not only outputs the results from the Tuning, it also outputs the Trained Model.

      Step 2: Train Model

      Step 3: Evaluate Model

      Scenario: You need to provide the test results to the Fabrikam Residences team. You create data visualizations to aid in presenting the results.

      You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test evaluation of the model. You need to select appropriate methods for producing the ROC curve in Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision Jungle modules with one another.

    3. You need to visually identify whether outliers exist in the Age column and quantify the outliers before the outliers are removed.

      Which three Azure Machine Learning Studio modules should you use? Each correct answer presents part of the solution.

      NOTE: Each correct selection is worth one point.

      • Create Scatterplot
      • Summarize Data
      • Clip Values
      • Replace Discrete Values
      • Build Counting Transform
      Explanation:

      B: To have a global view, the summarize data module can be used. Add the module and connect it to the data set that needs to be visualized.
      A: One way to quickly identify Outliers visually is to create scatter plots.

      C: The easiest way to treat the outliers in Azure ML is to use the Clip Values module. It can identify and optionally replace data values that are above or below a specified threshold.

      You can use the Clip Values module in Azure Machine Learning Studio, to identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, a constant, or other substitute value.

    4. HOTSPOT

      You need to identify the methods for dividing the data according to the testing requirements.

      Which properties should you select? To answer, select the appropriate options in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 153 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 153 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 153 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 153 Answer
      Explanation:

      Scenario: Testing
      You must produce multiple partitions of a dataset based on sampling using the Partition and Sample module in Azure Machine Learning Studio.

      Box 1: Assign to folds
      Use Assign to folds option when you want to divide the dataset into subsets of the data. This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups.

      Not Head: Use Head mode to get only the first n rows. This option is useful if you want to test a pipeline on a small number of rows, and don’t need the data to be balanced or sampled in any way.

      Not Sampling: The Sampling option supports simple random sampling or stratified random sampling. This is useful if you want to create a smaller representative sample dataset for testing.

      Box 2: Partition evenly
      Specify the partitioner method: Indicate how you want data to be apportioned to each partition, using these options:
      – Partition evenly: Use this option to place an equal number of rows in each partition. To specify the number of output partitions, type a whole number in the Specify number of folds to split evenly into text box.

    5. HOTSPOT

      You need to configure the Edit Metadata module so that the structure of the datasets match.

      Which configuration options should you select? To answer, select the appropriate options in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 154 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 154 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 154 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 154 Answer
      Explanation:

      Box 1: Floating point
      Need floating point for Median values.

      Scenario: An initial investigation shows that the datasets are identical in structure apart from the MedianValue column. The smaller Paris dataset contains the MedianValue in text format, whereas the larger London dataset contains the MedianValue in numerical format.

      Box 2: Unchanged

      Note: Select the Categorical option to specify that the values in the selected columns should be treated as categories.

      For example, you might have a column that contains the numbers 0,1 and 2, but know that the numbers actually mean “Smoker”, “Non smoker” and “Unknown”. In that case, by flagging the column as categorical you can ensure that the values are not used in numeric calculations, only to group data.

    6. HOTSPOT

      You need to configure the Permutation Feature Importance module for the model training requirements.

      What should you do? To answer, select the appropriate options in the dialog box in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 155 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 155 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 155 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 155 Answer
      Explanation:

      Box 1: 500
      For Random seed, type a value to use as seed for randomization. If you specify 0 (the default), a number is generated based on the system clock.

      A seed value is optional, but you should provide a value if you want reproducibility across runs of the same experiment.
      Here we must replicate the findings.

      Box 2: Mean Absolute Error
      Scenario: Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You need to set up the Permutation Feature Importance module to select the correct metric to investigate the model’s accuracy and replicate the findings.

      Regression. Choose one of the following: Precision, Recall, Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error, Relative Squared Error, Coefficient of Determination

    7. You need to select a feature extraction method.

      Which method should you use?

      • Mutual information
      • Pearson’s correlation
      • Spearman correlation
      • Fisher Linear Discriminant Analysis
      Explanation:

      Spearman’s rank correlation coefficient assesses how well the relationship between two variables can be described using a monotonic function.

      Note: Both Spearman’s and Kendall’s can be formulated as special cases of a more general correlation coefficient, and they are both appropriate in this scenario.

      Scenario: The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail.

      Incorrect Answers:
      B: The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic relationships (whether linear or not).

    8. HOTSPOT

      You need to set up the Permutation Feature Importance module according to the model training requirements.

      Which properties should you select? To answer, select the appropriate options in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 156 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 156 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 156 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 156 Answer
      Explanation:

      Box 1: Accuracy

      Scenario: You want to configure hyperparameters in the model learning process to speed the learning phase by using hyperparameters. In addition, this configuration should cancel the lowest performing runs at each evaluation interval, thereby directing effort and resources towards models that are more likely to be successful.

      Box 2: R-Squared

    9. HOTSPOT

      You need to configure the Feature Based Feature Selection module based on the experiment requirements and datasets.

      How should you configure the module properties? To answer, select the appropriate options in the dialog box in the answer area.

      NOTE: Each correct selection is worth one point.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 157 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 157 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 157 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 157 Answer
      Explanation:

      Box 1: Mutual Information.
      The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.

      Box 2: MedianValue
      MedianValue is the feature column, , it is the predictor of the dataset.

      Scenario: The MedianValue and AvgRoomsinHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail.

    10. You need to select a feature extraction method.

      Which method should you use?

      • Mutual information
      • Mood’s median test
      • Kendall correlation
      • Permutation Feature Importance
      Explanation:

      In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities.
      It is a supported method of the Azure Machine Learning Feature selection.

      Note: Both Spearman’s and Kendall’s can be formulated as special cases of a more general correlation coefficient, and they are both appropriate in this scenario.

      Scenario: The MedianValue and AvgRoomsInHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail.

    11. DRAG DROP

      You need to implement an early stopping criteria policy for model training.

      Which three code segments should you use to develop the solution? To answer, move the appropriate code segments from the list of code segments to the answer area and arrange them in the correct order.

      NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 158 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 158 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 158 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 158 Answer
      Explanation:

      You need to implement an early stopping criterion on models that provides savings without terminating promising jobs.

      Truncation selection cancels a given percentage of lowest performing runs at each evaluation interval. Runs are compared based on their performance on the primary metric and the lowest X% are terminated.

      Example:
      from azureml.train.hyperdrive import TruncationSelectionPolicy
      early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5)

      Incorrect Answers:
      Bandit is a termination policy based on slack factor/slack amount and evaluation interval. The policy early terminates any runs where the primary metric is not within the specified slack factor / slack amount with respect to the best performing training run.

      Example:
      from azureml.train.hyperdrive import BanditPolicy
      early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5

    12. DRAG DROP

      You need to implement early stopping criteria as stated in the model training requirements.

      Which three code segments should you use to develop the solution? To answer, move the appropriate code segments from the list of code segments to the answer area and arrange them in the correct order.

      NOTE: More than one order of answer choices is correct. You will receive the credit for any of the correct orders you select.

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 159 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 159 Question
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 159 Answer
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q12 159 Answer
      Explanation:

      Step 1: from azureml.train.hyperdrive

      Step 2: Import TruncationCelectionPolicy
      Truncation selection cancels a given percentage of lowest performing runs at each evaluation interval. Runs are compared based on their performance on the primary metric and the lowest X% are terminated.

      Scenario: You must configure hyperparameters in the model learning process to speed the learning phase. In addition, this configuration should cancel the lowest performing runs at each evaluation interval, thereby directing effort and resources towards models that are more likely to be successful.

      Step 3: early_terminiation_policy = TruncationSelectionPolicy..

      Example:
      from azureml.train.hyperdrive import TruncationSelectionPolicy
      early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5)
      In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A run will be terminated at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all runs at interval 5.

      Incorrect Answers:
      Median:
      Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. This policy computes running averages across all training runs and terminates runs whose performance is worse than the median of the running averages.

      Slack:
      Bandit is a termination policy based on slack factor/slack amount and evaluation interval. The policy early terminates any runs where the primary metric is not within the specified slack factor / slack amount with respect to the best performing training run.

  13. HOTSPOT

    You are a lead data scientist for a project that tracks the health and migration of birds. You create a multi-image classification deep learning model that uses a set of labeled bird photos collected by experts. You plan to use the model to develop a cross-platform mobile app that predicts the species of bird captured by app users.

    You must test and deploy the trained model as a web service. The deployed model must meet the following requirements:

    – An authenticated connection must not be required for testing.
    – The deployed model must perform with low latency during inferencing.
    – The REST endpoints must be scalable and should have a capacity to handle large number of requests when multiple end users are using the mobile application.

    You need to verify that the web service returns predictions in the expected JSON format when a valid REST request is submitted.

    Which compute resources should you use? To answer, select the appropriate options in the answer area.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q13 160 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q13 160 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q13 160 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q13 160 Answer
    Explanation:

    Box 1: ds-workstation notebook VM
    An authenticated connection must not be required for testing.
    On a Microsoft Azure virtual machine (VM), including a Data Science Virtual Machine (DSVM), you create local user accounts while provisioning the VM. Users then authenticate to the VM by using these credentials.

    Box 2: gpu-compute cluster
    Image classification is well suited for GPU compute clusters

  14. You create a deep learning model for image recognition on Azure Machine Learning service using GPU-based training.

    You must deploy the model to a context that allows for real-time GPU-based inferencing.

    You need to configure compute resources for model inferencing.

    Which compute type should you use?

    • Azure Container Instance
    • Azure Kubernetes Service
    • Field Programmable Gate Array
    • Machine Learning Compute
    Explanation:

    You can use Azure Machine Learning to deploy a GPU-enabled model as a web service. Deploying a model on Azure Kubernetes Service (AKS) is one option. The AKS cluster provides a GPU resource that is used by the model for inference.

    Inference, or model scoring, is the phase where the deployed model is used to make predictions. Using GPUs instead of CPUs offers performance advantages on highly parallelizable computation.

  15. You create a batch inference pipeline by using the Azure ML SDK. You run the pipeline by using the following code:

    from azureml.pipeline.core import Pipeline
    from azureml.core.experiment import Experiment
    pipeline = Pipeline(workspace=ws, steps=[parallelrun_step])
    pipeline_run = Experiment(ws, 'batch_pipeline').submit(pipeline)

    You need to monitor the progress of the pipeline execution.

    What are two possible ways to achieve this goal? Each correct answer presents a complete solution.

    NOTE: Each correct selection is worth one point.

    • DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q15 161
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q15 161
    • Use the Inference Clusters tab in Machine Learning Studio.
    • Use the Activity log in the Azure portal for the Machine Learning workspace.
    • Run the following code in a notebook:

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q15 162
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q15 162
    • Run the following code and monitor the console output from the PipelineRun object:

      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q15 163
      DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q15 163
    Explanation:

    A batch inference job can take a long time to finish. This example monitors progress by using a Jupyter widget. You can also manage the job’s progress by using:
    – Azure Machine Learning Studio.
    – Console output from the PipelineRun object.

    from azureml.widgets import RunDetails
    RunDetails(pipeline_run).show()

    pipeline_run.wait_for_completion(show_output=True)

  16. You train and register a model in your Azure Machine Learning workspace.

    You must publish a pipeline that enables client applications to use the model for batch inferencing. You must use a pipeline with a single ParallelRunStep step that runs a Python inferencing script to get predictions from the input data.

    You need to create the inferencing script for the ParallelRunStep pipeline step.

    Which two functions should you include? Each correct answer presents part of the solution.

    NOTE: Each correct selection is worth one point.

    • run(mini_batch)
    • main()
    • batch()
    • init()
    • score(mini_batch)
  17. You deploy a model as an Azure Machine Learning real-time web service using the following code.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q17 164
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q17 164

    The deployment fails.

    You need to troubleshoot the deployment failure by determining the actions that were performed during deployment and identifying the specific action that failed.

    Which code segment should you run?

    • service.get_logs()
    • service.state
    • service.serialize()
    • service.update_deployment_state()
    Explanation:

    You can print out detailed Docker engine log messages from the service object. You can view the log for ACI, AKS, and Local deployments. The following example demonstrates how to print the logs.

    # if you already have the service object handy
    print(service.get_logs())

    # if you only know the name of the service (note there might be multiple services with the same name but different version number)
    print(ws.webservices[‘mysvc’].get_logs())

  18. HOTSPOT

    You deploy a model in Azure Container Instance.

    You must use the Azure Machine Learning SDK to call the model API.

    You need to invoke the deployed model using native SDK classes and methods.

    How should you complete the command? To answer, select the appropriate options in the answer areas.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q18 165 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q18 165 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q18 165 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 08 Q18 165 Answer
    Explanation:

    Explanation:

    Box 1: from azureml.core.webservice import Webservice
    The following code shows how to use the SDK to update the model, environment, and entry script for a web service to Azure Container Instances:

    from azureml.core import Environment
    from azureml.core.webservice import Webservice
    from azureml.core.model import Model, InferenceConfig

    Box 2: predictions = service.run(input_json)

    Example: The following code demonstrates sending data to the service:
    import json

    test_sample = json.dumps({‘data’: [
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
    ]})

    test_sample = bytes(test_sample, encoding=’utf8′)

    prediction = service.run(input_data=test_sample)
    print(prediction)

  19. You create a multi-class image classification deep learning model.

    You train the model by using PyTorch version 1.2.

    You need to ensure that the correct version of PyTorch can be identified for the inferencing environment when the model is deployed.

    What should you do?

    • Save the model locally as a.pt file, and deploy the model as a local web service.
    • Deploy the model on computer that is configured to use the default Azure Machine Learning conda environment.
    • Register the model with a .pt file extension and the default version property.
    • Register the model, specifying the model_framework and model_framework_version properties.
    Explanation:
    framework_version: The PyTorch version to be used for executing training code.
  20. You train a machine learning model.

    You must deploy the model as a real-time inference service for testing. The service requires low CPU utilization and less than 48 MB of RAM. The compute target for the deployed service must initialize automatically while minimizing cost and administrative overhead.

    Which compute target should you use?

    • Azure Container Instance (ACI)
    • attached Azure Databricks cluster
    • Azure Kubernetes Service (AKS) inference cluster
    • Azure Machine Learning compute cluster
    Explanation:

    Azure Container Instances (ACI) are suitable only for small models less than 1 GB in size.
    Use it for low-scale CPU-based workloads that require less than 48 GB of RAM.

    Note: Microsoft recommends using single-node Azure Kubernetes Service (AKS) clusters for dev-test of larger models.