DP-100 : Designing and Implementing a Data Science Solution on Azure : Part 02

  1. You plan to provision an Azure Machine Learning Basic edition workspace for a data science project.

    You need to identify the tasks you will be able to perform in the workspace.

    Which three tasks will you be able to perform? Each correct answer presents a complete solution.

    NOTE: Each correct selection is worth one point.

    • Create a Compute Instance and use it to run code in Jupyter notebooks.
    • Create an Azure Kubernetes Service (AKS) inference cluster.
    • Use the designer to train a model by dragging and dropping pre-defined modules.
    • Create a tabular dataset that supports versioning.
    • Use the Automated Machine Learning user interface to train a model.

    Explanation:

    C, E: The UI is included the Enterprise edition only.

  2. HOTSPOT

    A coworker registers a datastore in a Machine Learning services workspace by using the following code:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q02 007
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q02 007

    You need to write code to access the datastore from a notebook.

    How should you complete the code segment? To answer, select the appropriate options in the answer area.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q02 008 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q02 008 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q02 008 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q02 008 Answer
    Explanation:

    Box 1: DataStore
    To get a specific datastore registered in the current workspace, use the get() static method on the Datastore class:
    # Get a named datastore from the current workspace
    datastore = Datastore.get(ws, datastore_name=’your datastore name’)

    Box 2: ws

    Box 3: demo_datastore

  3. A set of CSV files contains sales records. All the CSV files have the same data schema.

    Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file is stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q03 009
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q03 009

    At the end of each month, a new folder with that month’s sales file is added to the sales folder.

    You plan to use the sales data to train a machine learning model based on the following requirements:

    – You must define a dataset that loads all of the sales data to date into a structure that can be easily converted to a dataframe.
    – You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month.
    – You must register the minimum number of datasets possible.

    You need to register the sales data as a dataset in Azure Machine Learning service workspace.

    What should you do?

    • Create a tabular dataset that references the datastore and explicitly specifies each ‘sales/mm-yyyy/sales.csv’ file every month. Register the dataset with the name sales_dataset each month, replacing the existing dataset and specifying a tag named month indicating the month and year it was registered. Use this dataset for all experiments.
    • Create a tabular dataset that references the datastore and specifies the path ‘sales/*/sales.csv’, register the dataset with the name sales_dataset and a tag named month indicating the month and year it was registered, and use this dataset for all experiments.
    • Create a new tabular dataset that references the datastore and explicitly specifies each ‘sales/mm-yyyy/sales.csv’ file every month. Register the dataset with the name sales_dataset_MM-YYYY each month with appropriate MM and YYYY values for the month and year. Use the appropriate month-specific dataset for experiments.
    • Create a tabular dataset that references the datastore and explicitly specifies each ‘sales/mm-yyyy/sales.csv’ file. Register the dataset with the name sales_dataset each month as a new version and with a tag named month indicating the month and year it was registered. Use this dataset for all experiments, identifying the version to be used based on the month tag as necessary.
    Explanation:

    Specify the path.

    Example:
    The following code gets the workspace existing workspace and the desired datastore by name. And then passes the datastore and file locations to the path parameter to create a new TabularDataset, weather_ds.

    from azureml.core import Workspace, Datastore, Dataset

    datastore_name = ‘your datastore name’

    # get existing workspace
    workspace = Workspace.from_config()

    # retrieve an existing datastore in the workspace by name
    datastore = Datastore.get(workspace, datastore_name)

    # create a TabularDataset from 3 file paths in datastore
    datastore_paths = [(datastore, ‘weather/2018/11.csv’),
    (datastore, ‘weather/2018/12.csv’),
    (datastore, ‘weather/2019/*.csv’)]

    weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

  4. DRAG DROP

    An organization uses Azure Machine Learning service and wants to expand their use of machine learning.

    You have the following compute environments. The organization does not want to create another compute environment.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 010
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 010

    You need to determine which compute environment to use for the following scenarios.

    Which compute types should you use? To answer, drag the appropriate compute environments to the correct scenarios. Each compute environment may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 011 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 011 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 011 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 011 Answer
    Explanation:

    Box 1: nb_server

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 012
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q04 012

    Box 2: mlc_cluster
    With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as compute targets. A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDInsight or a remote virtual machine.

  5. HOTSPOT

    You create an Azure Machine Learning compute target named ComputeOne by using the STANDARD_D1 virtual machine image.

    ComputeOne is currently idle and has zero active nodes.

    You define a Python variable named ws that references the Azure Machine Learning workspace. You run the following Python code:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q05 013
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q05 013

    For each of the following statements, select Yes if the statement is true. Otherwise, select No.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q05 014 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q05 014 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q05 014 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q05 014 Answer
    Explanation:

    Box 1: Yes
    Compute Target Exception class: An exception related to failures when creating, interacting with, or configuring a compute target. This exception is commonly raised for failures attaching a compute target, missing headers, and unsupported configuration values.

    Create(workspace, name, provisioning_configuration)
    Provision a Compute object by specifying a compute type and related configuration.

    This method creates a new compute target rather than attaching an existing one.

    Box 2: Yes

    Box 3: No
    The line before print(‘Step1’) will fail.

  6. HOTSPOT

    You are developing a deep learning model by using TensorFlow. You plan to run the model training workload on an Azure Machine Learning Compute Instance.

    You must use CUDA-based model training.

    You need to provision the Compute Instance.

    Which two virtual machines sizes can you use? To answer, select the appropriate virtual machine sizes in the answer area.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q06 015 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q06 015 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q06 015 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q06 015 Answer
    Explanation:
    CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation.
  7. DRAG DROP

    You are analyzing a raw dataset that requires cleaning.

    You must perform transformations and manipulations by using Azure Machine Learning Studio.

    You need to identify the correct modules to perform the transformations.

    Which modules should you choose? To answer, drag the appropriate modules to the correct scenarios. Each module may be used once, more than once, or not at all.
    You may need to drag the split bar between panes or scroll to view content.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q07 016 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q07 016 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q07 016 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q07 016 Answer
    Explanation:

    Box 1: Clean Missing Data

    Box 2: SMOTE
    Use the SMOTE module in Azure Machine Learning Studio to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

    Box 3: Convert to Indicator Values
    Use the Convert to Indicator Values module in Azure Machine Learning Studio. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.

    Box 4: Remove Duplicate Rows

  8. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You are using Azure Machine Learning Studio to perform feature engineering on a dataset.

    You need to normalize values to produce a feature column grouped into bins.

    Solution: Apply an Entropy Minimum Description Length (MDL) binning mode.

    Does the solution meet the goal?

    • Yes
    • No
    Explanation:
    Entropy MDL binning mode: This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named <colname>quantized.
  9. HOTSPOT

    You are preparing to use the Azure ML SDK to run an experiment and need to create compute. You run the following code:

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q09 017
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q09 017

    For each of the following statements, select Yes if the statement is true. Otherwise, select No.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q09 018 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q09 018 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q09 018 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q09 018 Answer

    Explanation:

    Box 1: No
    If a compute cluster already exists it will be used.

    Box 2: Yes
    The wait_for_completion method waits for the current provisioning operation to finish on the cluster.

    Box 3: Yes
    Low Priority VMs use Azure’s excess capacity and are thus cheaper but risk your run being pre-empted.

    Box 4: No
    Need to use training_compute.delete() to deprovision and delete the AmlCompute target.

  10. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You are a data scientist using Azure Machine Learning Studio.

    You need to normalize values to produce an output column into bins to predict a target column.

    Solution: Apply a Quantiles normalization with a QuantileIndex normalization.

    Does the solution meet the goal?

    • Yes
    • No
    Explanation:
    Use the Entropy MDL binning mode which has a target column.
  11. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You are creating a new experiment in Azure Machine Learning Studio.

    One class has a much smaller number of observations than the other classes in the training set.

    You need to select an appropriate data sampling strategy to compensate for the class imbalance.

    Solution: You use the Scale and Reduce sampling mode.

    Does the solution meet the goal?

    • Yes
    • No
    Explanation:

    Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.

    Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

    Incorrect Answers:
    Common data tasks for the Scale and Reduce sampling mode include clipping, binning, and normalizing numerical values.

  12. You are analyzing a dataset by using Azure Machine Learning Studio.

    You need to generate a statistical summary that contains the p-value and the unique count for each feature column.

    Which two modules can you use? Each correct answer presents a complete solution.

    NOTE: Each correct selection is worth one point.

    • Computer Linear Correlation
    • Export Count Table
    • Execute Python Script
    • Convert to Indicator Values
    • Summarize Data
    Explanation:

    The Export Count Table module is provided for backward compatibility with experiments that use the Build Count Table (deprecated) and Count Featurizer (deprecated) modules.

    E: Summarize Data statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know:
    – How many missing values are there in each column?
    – How many unique values are there in a feature column?
    – What is the mean and standard deviation for each column?
    – The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input.

    Incorrect Answers:
    A: The Compute Linear Correlation module in Azure Machine Learning Studio is used to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset.

    C: With Python, you can perform tasks that aren’t currently supported by existing Studio modules such as:
    Visualizing data using matplotlib
    Using Python libraries to enumerate datasets and models in your workspace
    Reading, loading, and manipulating data from sources not supported by the Import Data module

    D: The purpose of the Convert to Indicator Values module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.

  13. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You are analyzing a numerical dataset which contains missing values in several columns.

    You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.

    You need to analyze a full dataset to include all values.

    Solution: Use the Last Observation Carried Forward (LOCF) method to impute the missing data points.

    Does the solution meet the goal?

    • Yes
    • No
    Explanation:

    Instead use the Multiple Imputation by Chained Equations (MICE) method.

    Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as “Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.

    Note: Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study.

  14. HOTSPOT

    You are creating a machine learning model in Python. The provided dataset contains several numerical columns and one text column. The text column represents a product’s category. The product category will always be one of the following:

    – Bikes
    – Cars
    – Vans
    – Boats

    You are building a regression model using the scikit-learn Python package.

    You need to transform the text data to be compatible with the scikit-learn Python package.

    How should you complete the code segment? To answer, select the appropriate options in the answer area.

    NOTE: Each correct selection is worth one point.

    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q14 019 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q14 019 Question
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q14 019 Answer
    DP-100 Designing and Implementing a Data Science Solution on Azure Part 02 Q14 019 Answer
    Explanation:

    Box 1: pandas as df
    Pandas takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (think Excel or SPSS for example.

    Box 2: transpose[ProductCategoryMapping]
    Reshape the data from the pandas Series to columns.

  15. You plan to deliver a hands-on workshop to several students. The workshop will focus on creating data visualizations using Python. Each student will use a device that has internet access.

    Student devices are not configured for Python development. Students do not have administrator access to install software on their devices. Azure subscriptions are not available for students.

    You need to ensure that students can run Python-based data visualization code.

    Which Azure tool should you use?

    • Anaconda Data Science Platform
    • Azure BatchAI
    • Azure Notebooks
    • Azure Machine Learning Service
  16. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You are analyzing a numerical dataset which contains missing values in several columns.

    You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.

    You need to analyze a full dataset to include all values.

    Solution: Replace each missing value using the Multiple Imputation by Chained Equations (MICE) method.

    Does the solution meet the goal?

    • Yes
    • No
    Explanation:

    Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as “Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.

    Note: Multivariate imputation by chained equations (MICE), sometimes called “fully conditional specification” or “sequential regression multiple imputation” has emerged in the statistical literature as one principled method of addressing missing data. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types (e.g., continuous or binary) as well as complexities such as bounds or survey skip patterns.

  17. Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

    After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

    You are analyzing a numerical dataset which contains missing values in several columns.

    You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.

    You need to analyze a full dataset to include all values.

    Solution: Remove the entire column that contains the missing data point.

    Does the solution meet the goal?

    • Yes
    • No
    Explanation:
    Use the Multiple Imputation by Chained Equations (MICE) method.
  18. You are creating a new experiment in Azure Machine Learning Studio. You have a small dataset that has missing values in many columns. The data does not require the application of predictors for each column. You plan to use the Clean Missing Data.

    You need to select a data cleaning method.

    Which method should you use?

    • Replace using Probabilistic PCA
    • Normalization
    • Synthetic Minority Oversampling Technique (SMOTE)
    • Replace using MICE
    Explanation:
    Replace using Probabilistic PCA: Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns.
  19. You use Azure Machine Learning Studio to build a machine learning experiment.

    You need to divide data into two distinct datasets.

    Which module should you use?

    • Split Data
    • Load Trained Model
    • Assign Data to Clusters
    • Group Data into Bins
    Explanation:
    The Group Data into Bins module supports multiple options for binning data. You can customize how the bin edges are set and how values are apportioned into the bins.
  20. You are a lead data scientist for a project that tracks the health and migration of birds. You create a multi-class image classification deep learning model that uses a set of labeled bird photographs collected by experts.

    You have 100,000 photographs of birds. All photographs use the JPG format and are stored in an Azure blob container in an Azure subscription.

    You need to access the bird photograph files in the Azure blob container from the Azure Machine Learning service workspace that will be used for deep learning model training. You must minimize data movement.

    What should you do?

    • Create an Azure Data Lake store and move the bird photographs to the store.
    • Create an Azure Cosmos DB database and attach the Azure Blob containing bird photographs storage to the database.
    • Create and register a dataset by using TabularDataset class that references the Azure blob storage containing bird photographs.
    • Register the Azure blob storage containing the bird photographs as a datastore in Azure Machine Learning service.
    • Copy the bird photographs to the blob datastore that was created with your Azure Machine Learning service workspace.
    Explanation:
    We recommend creating a datastore for an Azure Blob container. When you create a workspace, an Azure blob container and an Azure file share are automatically registered to the workspace.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments