DP-100 : Designing and Implementing a Data Science Solution on Azure : Part 05
-
DRAG DROP
You have a dataset that contains over 150 features. You use the dataset to train a Support Vector Machine (SVM) binary classifier.
You need to use the Permutation Feature Importance module in Azure Machine Learning Studio to compute a set of feature importance scores for the dataset.
In which order should you perform the actions? To answer, move all actions from the list of actions to the answer area and arrange them in the correct order.
Explanation:Step 1: Add a Two-Class Support Vector Machine module to initialize the SVM classifier.
Step 2: Add a dataset to the experiment
Step 3: Add a Split Data module to create training and test dataset.
To generate a set of feature scores requires that you have an already trained model, as well as a test dataset.Step 4: Add a Permutation Feature Importance module and connect to the trained model and test dataset.
Step 5: Set the Metric for measuring performance property to Classification – Accuracy and then run the experiment.
-
HOTSPOT
You are using the Hyperdrive feature in Azure Machine Learning to train a model.
You configure the Hyperdrive experiment by running the following code:
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: Yes
In random sampling, hyperparameter values are randomly selected from the defined search space. Random sampling allows the search space to include both discrete and continuous hyperparameters.Box 2: Yes
learning_rate has a normal distribution with mean value 10 and a standard deviation of 3.Box 3: No
keep_probability has a uniform distribution with a minimum value of 0.05 and a maximum value of 0.1.Box 4: No
number_of_hidden_layers takes on one of the values [3, 4, 5]. -
You are performing a filter-based feature selection for a dataset to build a multi-class classifier by using Azure Machine Learning Studio.
The dataset contains categorical features that are highly correlated to the output label column.
You need to select the appropriate feature scoring statistical method to identify the key predictors.
Which method should you use?
- Kendall correlation
- Spearman correlation
- Chi-squared
- Pearson correlation
Explanation:Pearson’s correlation statistic, or Pearson’s correlation coefficient, is also known in statistical models as the r value. For any two variables, it returns a value that indicates the strength of the correlation
Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship.
Incorrect Answers:
C: The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. -
HOTSPOT
You create a binary classification model to predict whether a person has a disease.
You need to detect possible classification errors.
Which error type should you choose for each description? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: True Positive
A true positive is an outcome where the model correctly predicts the positive classBox 2: True Negative
A true negative is an outcome where the model correctly predicts the negative class.Box 3: False Positive
A false positive is an outcome where the model incorrectly predicts the positive class.Box 4: False Negative
A false negative is an outcome where the model incorrectly predicts the negative class.
Note: Let’s make the following definitions:“Wolf” is a positive class.
“No wolf” is a negative class.
We can summarize our “wolf-prediction” model using a 2×2 confusion matrix that depicts all four possible outcomes: -
HOTSPOT
You are using the Azure Machine Learning Service to automate hyperparameter exploration of your neural network classification model.
You must define the hyperparameter space to automatically tune hyperparameters using random sampling according to following requirements:
– The learning rate must be selected from a normal distribution with a mean value of 10 and a standard deviation of 3.
– Batch size must be 16, 32 and 64.
– Keep probability must be a value selected from a uniform distribution between the range of 0.05 and 0.1.You need to use the param_sampling method of the Python API for the Azure Machine Learning Service.
How should you complete the code segment? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: normal(10,3)
Box 2: choice(16, 32, 64)
Box 3: uniform(0.05, 0.1)
In random sampling, hyperparameter values are randomly selected from the defined search space. Random sampling allows the search space to include both discrete and continuous hyperparameters.
Example:
from azureml.train.hyperdrive import RandomParameterSampling
param_sampling = RandomParameterSampling( {
“learning_rate”: normal(10, 3),
“keep_probability”: uniform(0.05, 0.1),
“batch_size”: choice(16, 32, 64)
} -
You plan to use automated machine learning to train a regression model. You have data that has features which have missing values, and categorical features with few distinct values.
You need to configure automated machine learning to automatically impute missing values and encode categorical features as part of the training task.
Which parameter and value pair should you use in the AutoMLConfig class?
-
featurization = 'auto'
-
enable_voting_ensemble = True
-
task = 'classification'
-
exclude_nan_labels = True
-
enable_tf = True
Explanation:Featurization str or FeaturizationConfig
Values: ‘auto’ / ‘off’ / FeaturizationConfig
Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used.Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:
Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.
Numeric: Impute missing values, cluster distance, weight of evidence.
DateTime: Several features such as day, seconds, minutes, hours etc.Text: Bag of words, pre-trained Word embedding, text target encoding.
-
-
DRAG DROP
You create a training pipeline using the Azure Machine Learning designer. You upload a CSV file that contains the data from which you want to train your model.
You need to use the designer to create a pipeline that includes steps to perform the following tasks:
– Select the training features using the pandas filter method.
– Train a model based on the naive_bayes.GaussianNB algorithm.
– Return only the Scored Labels column by using the query
– SELECT [Scored Labels] FROM t1;Which modules should you use? To answer, drag the appropriate modules to the appropriate locations. Each module name may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: Two-Class Neural Network
The Two-Class Neural Network creates a binary classifier using a neural network algorithm.
Train a model based on the naive_bayes.GaussianNB algorithm.Box 2: Execute python script
Select the training features using the pandas filter methodBox 3: Select Columns in DataSet
Return only the Scored Labels column by using the query SELECT [Scored Labels] FROM t1; -
You are building a regression model for estimating the number of calls during an event.
You need to determine whether the feature values achieve the conditions to build a Poisson regression model.
Which two conditions must the feature set contain? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
- The label data must be a negative value.
- The label data must be whole numbers.
- The label data must be non-discrete.
- The label data must be a positive value.
- The label data can be positive or negative.
Explanation:Poisson regression is intended for use in regression models that are used to predict numeric values, typically counts. Therefore, you should use this module to create your regression model only if the values you are trying to predict fit the following conditions:
– The response variable has a Poisson distribution.
– Counts cannot be negative. The method will fail outright if you attempt to use it with negative labels.
– A Poisson distribution is a discrete distribution; therefore, it is not meaningful to use this method with non-whole numbers -
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are creating a new experiment in Azure Machine Learning Studio.
One class has a much smaller number of observations than the other classes in the training set.
You need to select an appropriate data sampling strategy to compensate for the class imbalance.
Solution: You use the Principal Components Analysis (PCA) sampling mode.
Does the solution meet the goal?
- Yes
- No
Explanation:Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.
Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
Incorrect Answers:
The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features. -
You are performing feature engineering on a dataset.
You must add a feature named CityName and populate the column value with the text London.
You need to add the new feature to the dataset.
Which Azure Machine Learning Studio module should you use?
- Edit Metadata
- Filter Based Feature Selection
- Execute Python Script
- Latent Dirichlet Allocation
Explanation:Typical metadata changes might include marking columns as features. -
You are evaluating a completed binary classification machine learning model.
You need to use the precision as the evaluation metric.
Which visualization should you use?
- violin plot
- Gradient descent
- Scatter plot
- Receiver Operating Characteristic (ROC) curve
Explanation:Receiver operating characteristic (or ROC) is a plot of the correctly classified labels vs. the incorrectly classified labels for a particular model.
Incorrect Answers:
A: A violin plot is a visual that traditionally combines a box plot and a kernel density plot.B: Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
C: A scatter plot graphs the actual values in your data against the values predicted by the model. The scatter plot displays the actual values along the X-axis, and displays the predicted values along the Y-axis. It also displays a line that illustrates the perfect prediction, where the predicted value exactly matches the actual value.
-
You are solving a classification task.
You must evaluate your model on a limited data sample by using k-fold cross-validation. You start by configuring a k parameter as the number of splits.
You need to configure the k parameter for the cross-validation.
Which value should you use?
-
k=1
-
k=10
-
k=0.5
-
k=0.9
Explanation:Leave One Out (LOO) cross-validation
Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), a special case of the K-fold approach.LOO CV is sometimes useful but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.
This is why the usual choice is K=5 or 10. It provides a good compromise for the bias-variance tradeoff. -
-
HOTSPOT
You have a dataset created for multiclass classification tasks that contains a normalized numerical feature set with 10,000 data points and 150 features.
You use 75 percent of the data points for training and 25 percent for testing. You are using the scikit-learn machine learning library in Python. You use X to denote the feature set and Y to denote class labels.
You create the following Python data frames:
You need to apply the Principal Component Analysis (PCA) method to reduce the dimensionality of the feature set to 10 features in both training and testing sets.
How should you complete the code segment? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: PCA(n_components = 10)
Need to reduce the dimensionality of the feature set to 10 features in both training and testing sets.Example:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) ;2 dimensions
principalComponents = pca.fit_transform(x)Box 2: pca
fit_transform(X[, y]) fits the model with X and apply the dimensionality reduction on X.Box 3: transform(x_test)
transform(X) applies dimensionality reduction to X. -
HOTSPOT
You have a feature set containing the following numerical features: X, Y, and Z.
The Poisson correlation coefficient (r-value) of X, Y, and Z features is shown in the following image:
Use the drop-down menus to select the answer choice that answers each question based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: 0.859122
Box 2: a positively linear relationship
+1 indicates a strong positive linear relationship-1 indicates a strong negative linear correlation
0 denotes no linear relationship between the two variables.
-
DRAG DROP
You plan to explore demographic data for home ownership in various cities. The data is in a CSV file with the following format:
age,city,income,home_owner 21,Chicago,50000,0 35,Seattle,120000,1 23,Seattle,65000,0 45,Seattle,130000,1 18,Chicago,48000,0
You need to run an experiment in your Azure Machine Learning workspace to explore the data and log the results. The experiment must log the following information:
– the number of observations in the dataset
– a box plot of income by home_owner
– a dictionary containing the city names and the average income for each cityYou need to use the appropriate logging methods of the experiment’s run object to log the required information.
How should you complete the code? To answer, drag the appropriate code segments to the correct locations. Each code segment may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Explanation:Box 1: log
The number of observations in the dataset.run.log(name, value, description=”)
Scalar values: Log a numerical or string value to the run with the given name. Logging a metric to a run causes that metric to be stored in the run record in the experiment. You can log the same metric multiple times within a run, the result being considered a vector of that metric.Example: run.log(“accuracy”, 0.95)
Box 2: log_image
A box plot of income by home_owner.log_image Log an image to the run record. Use log_image to log a .PNG image file or a matplotlib plot to the run. These images will be visible and comparable in the run record.
Example: run.log_image(“ROC”, plot=plt)
Box 3: log_table
A dictionary containing the city names and the average income for each city.log_table: Log a dictionary object to the run with the given name.
-
You use the Azure Machine Learning service to create a tabular dataset named training_data. You plan to use this dataset in a training script.
You create a variable that references the dataset using the following code:
training_ds = workspace.datasets.get(“training_data”)
You define an estimator to run the script.
You need to set the correct property of the estimator to ensure that your script can access the training_data dataset.
Which property should you set?
-
environment_definition = {"training_data":training_ds}
-
inputs = [training_ds.as_named_input('training_ds')]
-
script_params = {"--training_ds":training_ds}
-
source_directory = training_ds
Explanation:Example:
# Get the training dataset
diabetes_ds = ws.datasets.get(“Diabetes Dataset”)# Create an estimator that uses the remote compute
hyper_estimator = SKLearn(source_directory=experiment_folder,
inputs=[diabetes_ds.as_named_input(‘diabetes’)], # Pass the dataset as an input
compute_target = cpu_cluster,
conda_packages=[‘pandas’,’ipykernel’,’matplotlib’],
pip_packages=[‘azureml-sdk’,’argparse’,’pyarrow’],
entry_script=’diabetes_training.py’) -
-
You register a file dataset named csv_folder that references a folder. The folder includes multiple comma-separated values (CSV) files in an Azure storage blob container.
You plan to use the following code to run a script that loads data from the file dataset. You create and instantiate the following variables:
You have the following code:
You need to pass the dataset to ensure that the script can read the files it references.
Which code segment should you insert to replace the code comment?
-
inputs=[file_dataset.as_named_input('training_files')],
-
inputs=[file_dataset.as_named_input('training_files').as_mount()],
-
inputs=[file_dataset.as_named_input('training_files').to_pandas_dataframe()],
-
script_params={'--training_files': file_dataset},
Explanation:Example:
from azureml.train.estimator import Estimatorscript_params = {
# to mount files referenced by mnist dataset
‘–data-folder’: mnist_file_dataset.as_named_input(‘mnist_opendataset’).as_mount(),
‘–regularization’: 0.5
}est = Estimator(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
environment_definition=env,
entry_script=’train.py’) -
-
You are creating a new Azure Machine Learning pipeline using the designer.
The pipeline must train a model using data in a comma-separated values (CSV) file that is published on a website. You have not created a dataset for this file.
You need to ingest the data from the CSV file into the designer pipeline using the minimal administrative effort.
Which module should you add to the pipeline in Designer?
- Convert to CSV
- Enter Data Manually
- Import Data
- Dataset
Explanation:The preferred way to provide data to a pipeline is a Dataset object. The Dataset object points to data that lives in or is accessible from a datastore or at a Web URL. The Dataset class is abstract, so you will create an instance of either a FileDataset (referring to one or more files) or a TabularDataset that’s created by from one or more files with delimited columns of data.
Example:
from azureml.core import Datasetiris_tabular_dataset = Dataset.Tabular.from_delimited_files([(def_blob_store, ‘train-dataset/iris.csv’)])
-
You define a datastore named ml-data for an Azure Storage blob container. In the container, you have a folder named train that contains a file named data.csv. You plan to use the file to train a model by using the Azure Machine Learning SDK.
You plan to train the model by using the Azure Machine Learning SDK to run an experiment on local compute.
You define a DataReference object by running the following code:
You need to load the training data.
Which code segment should you use?
Explanation:Example:
data_folder = args.data_folder
# Load Train and Test data
train_data = pd.read_csv(os.path.join(data_folder, ‘data.csv’)) -
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You create an Azure Machine Learning service datastore in a workspace. The datastore contains the following files:
– /data/2018/Q1.csv
– /data/2018/Q2.csv
-/data/2018/Q3.csv
– /data/2018/Q4.csv
– /data/2019/Q1.csvAll files store data in the following format:
id,f1,f2,I
1,1,2,0
2,1,1,1
3,2,1,0
4,2,2,1You run the following code:
You need to create a dataset named training_data and load the data from all files into a single data frame by using the following code:
Solution: Run the following code:
Does the solution meet the goal?
- yes
- No
Explanation:Define paths with two file paths instead.
Use Dataset.Tabular_from_delimeted as the data isn’t cleansed.